i 


Jeffrey S. Rosenthal 
Michael J. Evans 


Probability and Statistics: The Science of 
Uncertainty 


Michael J. Evans and Jeffrey S. Rosenthal University of Toronto 


Contents 


Preface 


1 Probability Models 


1.1 


1.2 


1.3 
1.4 


1.5 


1.6 
1.7 


2.1 
2.2 
23 
2.4 


2.5 


2.6 


2.7 


Probability: A Measure of Uncertainty. ................ 
1.1.1. Why Do We Need Probability Theory? ............ 
Probability Models ............0. 0.00000 ee eee 
1.2.1 Venn Diagrams and Subsets .................. 
Properties of Probability Models ................-00-. 
Uniform Probability on Finite Spaces ................. 
1.4.1 Combinatorial Principles. ................... 
Conditional Probability and Independence ............... 
1.5.1 Conditional Probability. .................... 
1.5.2 Independence of Events .................24. 
Continuity: Of P ass e a A ee Site ed ee oh ee gaa e a a E Bo 
Further Proofs (Advanced) .............0 00000000 - 


Random Variables and Distributions 


Discrete Distributions... .......... 0.0000 ee eee 
2.3.1 Important Discrete Distributions ................ 
Continuous Distributions .............2.0 00000 eee 
2.4.1 Important Absolutely Continuous Distributions ........ 
Cumulative Distribution Functions ................22. 
2.5.1 Properties of Distribution Functions .............. 
2.5.2 Cdfs of Discrete Distributions ................. 
2.5.3 Cdfs of Absolutely Continuous Distributions ......... 
2.5.4 Mixture Distributions..................200.% 
2.5.5 Distributions Neither Discrete Nor Continuous (Advanced) . . 
One-Dimensional Change of Variable ................. 
2.6.1 The Discrete Case ............ 00000000 2 ee 


CONTENTS 


2.7.2 Marginal Distributions ..................-.4. 81 
2.7.3 Joint Probability Functions. .................. 83 
2.7.4 Joint Density Functions ...................4. 85 
2.8 Conditioning and Independence .................... 93 
2.8.1 Conditioning on Discrete Random Variables. ......... 94 
2.8.2 Conditioning on Continuous Random Variables... ..... 95 
2.8.3 Independence of Random Variables .............. 97 
2.8.4 Order Statistics a s 2... ee 103 
2.9 Multidimensional Change of Variable ................. 109 
2.9.1 TheDiscreteCase ..............2..00 00004 109 
2.9.2 The Continuous Case (Advanced) ............... 110 
29:3 COnvOlUtON s: i- 5% sedis, sha, oe ete ER A a 113 
2.10 Simulating Probability Distributions .................. 116 
2.10.1 Simulating Discrete Distributions ............... 117 
2.10.2 Simulating Continuous Distributions. ............. 119 
2.11 Further Proofs (Advanced) ........... 2.200000 eee 125 
Expectation 129 
JA.. The Discrete Cases / de th 4 tee kk E ee ea eS, Eh alee 129 
3.2 The Absolutely Continuous Case... ............2.-.00.4 141 
3.3 Variance, Covariance, and Correlation ................. 149 
3.4 Generating Functions ............. 2.0.00 ee eee 162 
3.4.1 Characteristic Functions (Advanced)... ........... 169 
3.5 Conditional Expectation ...............2.2.0-0-00004 173 
3,5.’ Discrete: Case2S sf) hte aaah Phd, nc araa chats, Bateson. h 173 
3.5.2 Absolutely Continuous Case... ..........-.--4. 176 
3.5.3 Double Expectations .................-.04. 177 
3.5.4 Conditional Variance (Advanced) ............... 179 
3:6) Inequalities. 2.02 35.9.3 fie oF eee ere Ea ek ee ee 184 
3.6.1 Jensen’s Inequality (Advanced) ................ 187 
3.7 General Expectations (Advanced) ..............-.--05 191 
3.8 Further Proofs (Advanced) ..............02.00-020004 194 
Sampling Distributions and Limits 199 
4.1 Sampling Distributions ..............2.2. 20.2002 2 ee 200 
4.2 Convergence in Probability ..................0000.4 204 
4.2.1 The Weak Law of Large Numbers ............... 205 
4.3 Convergence with Probabiliy1]..................... 208 
4.3.1 The Strong Law of Large Numbers .............. 211 
4.4 Convergence in Distribution .................0 2004 213 
4.4.1 The Central Limit Theorem .................. 215 
4.4.2 The Central Limit Theorem and Assessing Error ....... 220 
4.5 Monte Carlo Approximations . . . . ooa a 224 
4.6 Normal Distribution Theory .. . oaaae 234 
4.6.1 The Chi-Squared Distribution ................. 236 


4.6.2 The ź Distribution . . . . ooa 239 


CONTENTS 


4.7 


4.6.3 The F Distribution ...................000.4 
Further Proofs (Advanced) ............2000 0000+ eae 


5 Statistical Inference 


6 


7 


5.1 
5.2 
5.3 
5.4 


5.5 


Why Do We Need Statistics? .. 2... ee ee 
Inference Using a Probability Model .................. 
Statistical Models 2 ie cee be ee le ae ae ee a 
Data Collections ina 26. @ yk, Pogo tig! ahh eae genera a ted, a 
5.4.1 Finite Populations .................-.--00- 
5.4.2 Simple Random Sampling ................... 
SAS Histograms <5 ve os Beet ae E E a ale A a E tte 
5.4.4 Survey Sampling ................2.2.-.--004 
Some Basic Inferences snina e rae to a a a 0.0000 E 
5.5.1 Descriptive Statistics . . . . . oaa aa ee ee 
5.3.2 Plotting: Data’ 0%. 6 eai at ga eats “atte a toe ad 
5.5.3 TypesofInferences................2.-.---0. 


Likelihood Inference 


6.1 


6.2 


6.3 


6.4 


6.5 


6.2.1 Computation of the MLE.................04. 
6.2.2 The Multidimensional Case (Advanced) ............ 
Inferences Based on the MLE... anana 
6.3.1 Standard Errors, Bias, and Consistency ............ 
6.3.2 Confidence Intervals . .. . o ooa 
6.3.3 Testing Hypotheses and P-Values ............... 
6.3.4 Inferences for the Variance . . . . ooo 
6.3.5 Sample-Size Calculations: Confidence Intervals . . ...... 
6.3.6 Sample-Size Calculations: Power ............... 
Distribution-Free Methods . ... aoaaa aaa 
6.4.1 Method of Moments ... noaa 
6.4.2 Bootstrapping . . . ooa 20.0... eee ee eee 
6.4.3 The Sign Statistic and Inferences about Quantiles ....... 
Asymptotics for the MLE (Advanced) . .. oaoa 


Bayesian Inference 


7.1 
7.2 


7.3 


The Prior and Posterior Distributions... ..............-. 
Inferences Based on the Posterior... .............-00-.- 
Tlr SBSUMANON. sy) 74, tote oct ft dA bat Seok RSS 4) oe. Sd 
7.2.2 CredibleIntervals..................-.2.004.% 
7.2.3 Hypothesis Testing and Bayes Factors ............. 
L2A Prediction... 5 240 ha8s a ee ee ele De ee 
Bayesian Computations... 2... 2.2.0... 00. e eee eee 
7.3.1 Asymptotic Normality of the Posterior... .......... 
7.3.2 Sampling from the Posterior ...............0.. 


vi 


9 


CONTENTS 


7.3.3 Sampling from the Posterior Via Gibbs Sampling (Advanced) 
7.4 Choosing Priors... . 2... ..0.0 00. eee ee 
PAA. Conjugate Pors- =y hoses yk essa Boas bg aoe oe wed Sate & 
142- Elicitation-.-.. 204 eae be bee ae a ee A 
7.4.3. EmpiricalBayes .............. 0000000008 
7.4.4 Hierarchical Bayes ..............2.2.00--0004 
7.4.5 Improper Priors and Noninformativity ............. 
7.5 Further Proofs (Advanced) ..............02. 000200048 


Optimal Inferences 

8.1 Optimal Unbiased Estimation. ..................00-. 
8.1.1 The Rao—Blackwell Theorem and Rao—Blackwellization . . . 
8.1.2 Completeness and the Lehmann-Scheffé Theorem ...... 
8.1.3 The Cramer—Rao Inequality (Advanced)... ......... 

8.2 Optimal Hypothesis Testing ...................0-.4- 
8.2.1 The Power Function of a Test ................. 
8.2.2 Type I and Type II Errors . . . o.oo aaa 
8.2.3 Rejection Regions and Test Functions ............. 
8.2.4 The Neyman-Pearson Theorem ................ 
8.2.5 Likelihood Ratio Tests (Advanced) .............. 

8.3 Optimal Bayesian Inferences . . .. o.oo 

8.4 Decision Theory (Advanced) . .. oaaae 

8.5 Further Proofs (Advanced) . .. aoaaa 0.00000 005 


Model Checking 

9.1 Checking the Sampling Model . .. onou a 
9.1.1 Residual and Probability Plots ................. 
9.1.2 The Chi-Squared Goodness of FitTest............. 
9.1.3 Prediction and Cross-Validation ................ 


9.2 Checking for Prior—-Data Conflict... ..............04. 
9.3 The Problem with Multiple Checks... ................ 


10 Relationships Among Variables 


10.1 Related Variables... 2... ee ee ee 
10.1.1 The Definition of Relationship ................. 
10.1.2 Cause—Effect Relationships and Experiments ......... 
10.1.3 Design of Experiments .................-0-44 

10.2 Categorical Response and Predictors .................. 
10.2.1 Random Predictor ................0.0.00004 
10.2.2 Deterministic Predictor... .............2.200-. 
10.2.3 Bayesian Formulation .................-0-44 

10.3 Quantitative Response and Predictors ................. 
10.3.1 The Method of Least Squares ................. 
10.3.2 The Simple Linear Regression Model ............. 
10.3.3 Bayesian Simple Linear Model (Advanced) .......... 


413 
421 


CONTENTS vii 


1 


an 


10.3.4 The Multiple Linear Regression Model (Advanced) . . .... 558 
10.4 Quantitative Response and Categorical 
Predictors: aari Anei ie Get Rea A A aaa oe AS i 577 
10.4.1 One Categorical Predictor (One-Way ANOVA) ........ 577 
10.4.2 Repeated Measures (Paired Comparisons) . . . . . oaa 584 
10.4.3 Two Categorical Predictors (Two-Way ANOVA) ....... 586 
10.4.4 Randomized Blocks ... ooon aaa 594 
10.4.5 One Categorical and One Quantitative Predictor . ....... 594 
10.5 Categorical Response and Quantitative 
Predictors: ooe a ae 4.5 Sede Ss Lae se eget Sete baa 602 
10.6 Further Proofs (Advanced) ............2.2.20-0-00 00008 607 
Advanced Topic — Stochastic Processes 615 
11.1 Simple Random Walk... aoaaa ee ee 615 
11.1.1 The Distribution of the Fortune ................ 616 
11.1.2 The Gambler’s Ruin Problem ................. 618 
11.2 Markov Chains ...... 2.2... a 0.00000 eee eee 623 
11.2.1 Examples of Markov Chains .................-. 624 
11.2.2 Computing with Markov Chains ................ 626 
11.2.3 Stationary Distributions .................0.-. 629 
11.2.4 Markov Chain Limit Theorem ................. 633 
11.3 Markov Chain Monte Carlo ..................000.4 641 
11.3.1 The Metropolis—Hastings Algorithm .............. 644 
11.3.2 The Gibbs Sampler. ..................000. 647 
FIA Martingales® 2% Teia a 5 Sate te kos, a OE ere te Paks ae ee? 650 
11.4.1 Definition ofa Martingale ...............0... 650 
11.4.2 Expected Values ... 2.2... a a 2.0.00 a a 651 
11.4.3 Stopping Times............... 0000000004 652 
11.5: Brownian Moonie. 403 gh ha at path eo ald ad Sere a 657 
11.5.1 Faster and Faster Random Walks. ............... 657 
11.5.2 Brownian MotionasaLimit.................. 659 
11.5.3 Diffusions and Stock Prices .................. 661 
11.6 Poisson Processes... 2... 1. ee ee 665 
LiT Further Proofs e fi dt Aa ea eae ee aaa See 668 
Appendices 675 
Mathematical Background 675 
A-T Derivatives’ 24.5.5.286 564:.434¢46 04442245 565442 675 
A-2 Integrals. Sst: ord ot Sa: es tae a a et Be 676 
ASS finite SOMOS a end oo ic ea a eh An Bier Gi nen owen snag Re 677 
A.4 Matrix Multiplication... 2... ee ee 678 
A.S -Partial Derivatives... 3406 8 ae a OR ee ee ee oe 678 
A.6 Multivariable Integrals .. 2... 2... 0.2... ee eee 679 


viii 


B Computations 
BA UsngR ere nh na ah 
B.2 Using Minitab. .......... 


C Common Distributions 


D.1 Random Numbers......... 
D.2 Standard Normal Cdf ....... 
D.3 Chi-Squared Distribution Quantiles 
D.4 t Distribution Quantiles. ..... 
D.5 F Distribution Quantiles ..... 
D.6 Binomial Distribution Probabilities 


E Answers to Odd-Numbered Exercises 


Index 


683 
683 
699 


705 
705 
706 


709 
710 
712 
713 
714 
715 
724 


729 


751 


Preface 


This book is an introductory text on probability and statistics, targeting students who 
have studied one year of calculus at the university level and are seeking an introduction 
to probability and statistics with mathematical content. Where possible, we provide 
mathematical details, and it is expected that students are seeking to gain some mastery 
over these, as well as to learn how to conduct data analyses. All the usual method- 
ologies covered in a typical introductory course are introduced, as well as some of the 
theory that serves as their justification. 

The text can be used with or without a statistical computer package. It is our opin- 
ion that students should see the importance of various computational techniques in 
applications, and the book attempts to do this. Accordingly, we feel that computational 
aspects of the subject, such as Monte Carlo, should be covered, even if a statistical 
package is not used. Almost any statistical package is suitable. A Computations 
appendix provides an introduction to the R language. This covers all aspects of the lan- 
guage needed to do the computations in the text. Furthermore, we have provided the R 
code for any of the more complicated computations. Students can use these examples 
as templates for problems that involve such computations, e.g., using Gibbs sampling. 
Also, we have provided, in a separate section of this appendix, Minitab code for those 
computations that are slightly involved, e.g., Gibbs sampling. No programming expe- 
rience is required of students to do the problems. 

We have organized the exercises in the book into groups, as an aid to users. Exer- 
cises are suitable for all students and offer practice in applying the concepts discussed 
in a particular section. Problems require greater understanding, and a student can ex- 
pect to spend more thinking time on these. If a problem is marked (MV), then it will 
require some facility with multivariable calculus beyond the first calculus course, al- 
though these problems are not necessarily hard. Challenges are problems that most 
students will find difficult; these are only for students who have no trouble with the 
Exercises and the Problems. There are also Computer Exercises and Computer 
Problems, where it is expected that students will make use of a statistical package in 
deriving solutions. 

We have included a number of Discussion Topics designed to promote critical 
thinking in students. Throughout the book, we try to point students beyond the mastery 
of technicalities to think of the subject in a larger frame of reference. It is important that 
students acquire a sound mathematical foundation in the basic techniques of probability 
and statistics, which we believe this book will help students accomplish. Ultimately, 
however, these subjects are applied in real-world contexts, so it is equally important 
that students understand how to go about their application and understand what issues 
arise. Often, there are no right answers to Discussion Topics; their purpose is to get a 


x Preface 


student thinking about the subject matter. If these were to be used for evaluation, then 
they would be answered in essay format and graded on the maturity the student showed 
with respect to the issues involved. Discussion Topics are probably most suitable for 
smaller classes, but these will also benefit students who simply read them over and 
contemplate their relevance. 

Some sections of the book are labelled Advanced. This material is aimed at stu- 
dents who are more mathematically mature (for example, they are taking, or have taken, 
a second course in calculus). All the Advanced material can be skipped, with no loss 
of continuity, by an instructor who wishes to do so. In particular, the final chapter of the 
text is labelled Advanced and would only be taught in a high-level introductory course 
aimed at specialists. Also, many proofs appear in the final section of many chapters, 
labelled Further Proofs (Advanced). An instructor can choose which (if any) of these 
proofs they wish to present to their students. 

As such, we feel that the material in the text is presented in a flexible way that 
allows the instructor to find an appropriate level for the students they are teaching. A 
Mathematical Background appendix reviews some mathematical concepts, from a 
first course in calculus, in case students could use a refresher, as well as brief introduc- 
tions to partial derivatives, double integrals, etc. 

Chapter 1 introduces the probability model and provides motivation for the study 
of probability. The basic properties of a probability measure are developed. 

Chapter 2 deals with discrete, continuous, joint distributions, and the effects of 
a change of variable. It also introduces the topic of simulating from a probability 
distribution. The multivariate change of variable is developed in an Advanced section. 

Chapter 3 introduces expectation. The probability-generating function is dis- 
cussed, as are the moments and the moment-generating function of a random variable. 
This chapter develops some of the major inequalities used in probability. A section on 
characteristic functions is included as an Advanced topic. 

Chapter 4 deals with sampling distributions and limits. Convergence in probabil- 
ity, convergence with probability 1, the weak and strong laws of large numbers, con- 
vergence in distribution, and the central limit theorem are all introduced, along with 
various applications such as Monte Carlo. The normal distribution theory, necessary 
for many statistical applications, is also dealt with here. 

As mentioned, Chapters 1 through 4 include material on Monte Carlo techniques. 
Simulation is a key aspect of the application of probability theory, and it is our view 
that its teaching should be integrated with the theory right from the start. This reveals 
the power of probability to solve real-world problems and helps convince students that 
it is far more than just an interesting mathematical theory. No practitioner divorces 
himself from the theory when using the computer for computations or vice versa. We 
believe this is amore modern way of teaching the subject. This material can be skipped, 
however, if an instructor believes otherwise or feels there is not enough time to cover 
it effectively. 

Chapter 5 is an introduction to statistical inference. For the most part, this is con- 
cerned with laying the groundwork for the development of more formal methodology 
in later chapters. So practical issues — such as proper data collection, presenting data 
via graphical techniques, and informal inference methods like descriptive statistics — 
are discussed here. 


Preface xi 


Chapter 6 deals with many of the standard methods of inference for one-sample 
problems. The theoretical justification for these methods is developed primarily through 
the likelihood function, but the treatment is still fairly informal. Basic methods of in- 
ference, such as the standard error of an estimate, confidence intervals, and P-values, 
are introduced. There is also a section devoted to distribution-free (nonparametric) 
methods like the bootstrap. 

Chapter 7 involves many of the same problems discussed in Chapter 6, but now 
from a Bayesian perspective. The point of view adopted here is not that Bayesian meth- 
ods are better or, for that matter, worse than those of Chapter 6. Rather, we take the 
view that Bayesian methods arise naturally when the statistician adds another ingredi- 
ent — the prior — to the model. The appropriateness of this, or the sampling model 
for the data, is resolved through the model-checking methods of Chapter 9. It is not 
our intention to have students adopt a particular philosophy. Rather, the text introduces 
students to a broad spectrum of statistical thinking. 

Subsequent chapters deal with both frequentist and Bayesian approaches to the 
various problems discussed. The Bayesian material is in clearly labelled sections and 
can be skipped with no loss of continuity, if so desired. It has become apparent in 
recent years, however, that Bayesian methodology is widely used in applications. As 
such, we feel that it is important for students to be exposed to this, as well as to the 
frequentist approaches, early in their statistical education. 

Chapter 8 deals with the traditional optimality justifications offered for some sta- 
tistical inferences. In particular, some aspects of optimal unbiased estimation and the 
Neyman—Pearson theorem are discussed. There is also a brief introduction to decision 
theory. This chapter is more formal and mathematical than Chapters 5, 6, and 7, and it 
can be skipped, with no loss of continuity, if an instructor wants to emphasize methods 
and applications. 

Chapter 9 is on model checking. We placed model checking in a separate chapter 
to emphasize its importance in applications. In practice, model checking is the way 
Statisticians justify the choices they make in selecting the ingredients of a statistical 
problem. While these choices are inherently subjective, the methods of this chapter 
provide checks to make sure that the choices made are sensible in light of the objective 
observed data. 

Chapter 10 is concerned with the statistical analysis of relationships among vari- 
ables. This includes material on simple linear and multiple regression, ANOVA, the 
design of experiments, and contingency tables. The emphasis in this chapter is on 
applications. 

Chapter 11 is concerned with stochastic processes. In particular, Markov chains 
and Markov chain Monte Carlo are covered in this chapter, as are Brownian motion and 
its relevance to finance. Fairly sophisticated topics are introduced, but the treatment is 
entirely elementary. Chapter 11 depends only on the material in Chapters | through 4. 

A one-semester course on probability would cover Chapters 1—4 and perhaps some 
of Chapter 11. A one-semester, follow-up course on statistics would cover Chapters 5— 
7 and 9-10. Chapter 8 is not necessary, but some parts, such as the theory of unbiased 
estimation and optimal testing, are suitable for a more theoretical course. 

A basic two-semester course in probability and statistics would cover Chapters 1—6 
and 9-10. Such a course covers all the traditional topics, including basic probability 


xii Preface 


theory, basic statistical inference concepts, and the usual introductory applied statistics 
topics. To cover the entire book would take three semesters, which could be organized 
in a variety of ways. 

The Advanced sections can be skipped or included, depending on the level of the 
students, with no loss of continuity. A similar approach applies to Chapters 7, 8, and 
11. 

Students who have already taken an introductory, noncalculus-based, applied sta- 
tistics course will also benefit from a course based on this text. While similar topics are 
covered, they are presented with more depth and rigor here. For example, Introduction 
to the Practice of Statistics, 6th ed., by D. Moore and G. McCabe (W. H. Freeman, 
2009) is an excellent text, and we believe that our book would serve as a strong basis 
for a follow-up course. 

There is an Instructor’s Solutions Manual available from the publisher. 

The second edition contains many more basic exercises than the first edition. Also, 
we have rewritten a number of sections, with the aim of making the material clearer to 
students. One goal in our rewriting was to subdivide the material into smaller, more 
digestible components so that key ideas stand out more boldly. There has been a com- 
plete typographical redesign that we feel aids in this as well. In the appendices, we have 
added material on the statistical package R as well as answers for the odd-numbered 
exercises that students can use to check their understanding. 

Many thanks to the reviewers and users for their comments: Abbas Alhakim (Clark- 
son University), Michelle Baillargeon (McMaster University), Arne C. Bathke (Univer- 
sity of Kentucky), Lisa A. Bloomer (Middle Tennessee State University), Christopher 
Brown (California Lutheran University), Jem N. Corcoran (University of Colorado), 
Guang Cheng (Purdue University), Yi Cheng (Indiana University South Bend), Eugene 
Demidenko (Dartmouth College), Robert P. Dobrow (Carleton College), John Ferdi- 
nands (Calvin College), Soledad A. Fernandez (The Ohio State University), Paramjit 
Gill (University of British Columbia Okanagan), Marvin Glover (Milligan College), 
Ellen Gundlach (Purdue University), Paul Gustafson (University of British Columbia), 
Jan Hannig (Colorado State University), Solomon W. Harrar (The University of Mon- 
tana), Susan Herring (Sonoma State University), George F. Hilton (Pacific Union Col- 
lege), Chun Jin (Central Connecticut State University), Paul Joyce (University of Idaho), 
Hubert Lilliefors (George Washington University), Andy R. Magid (University of Ok- 
lahoma), Phil McDonnough (University of Toronto), Julia Morton (Nipissing Univer- 
sity), Jean D. Opsomer (Colorado State University), Randall H. Rieger (West Chester 
University), Robert L. Schaefer (Miami University), Osnat Stramer (University of 
Iowa), Tim B. Swartz (Simon Fraser University), Glen Takahara (Queen’s University), 
Robert D. Thompson (Hunter College), David C. Vaughan (Wilfrid Laurier University), 
Joseph J. Walker (Georgia State University), Chad Westerland (University of Arizona), 
Dongfeng Wu (Mississippi State University), Yuehua Wu (York University), Nicholas 
Zaino (University of Rochester). In particular, Professor Chris Andrews (State Univer- 
sity of New York) provided many corrections to the first edition. 

The authors would also like to thank many who have assisted in the development 
of this project. In particular, our colleagues and students at the University of Toronto 
have been very supportive. Ping Gao, Aysha Hashim, Gun Ho Jang, Hadas Moshonov, 
and Mahinda Samarakoon helped in many ways. A number of the data sets in Chapter 


xiii 


10 have been used in courses at the University of Toronto for many years and were, we 
believe, compiled through the work of the late Professor Daniel B. DeLury. Professor 
David Moore of Purdue University was of assistance in providing several of the tables 
at the back of the text. Patrick Farace, Anne Scanlan-Rohrer, Chris Spavins, Danielle 
Swearengin, Brian Tedesco, Vivien Weiss, and Katrina Wilhelm of W. H. Freeman 
provided much support and encouragement. Our families helped us with their patience 
and care while we worked at what seemed at times an unending task; many thanks to 
Rosemary and Heather Evans and Margaret Fulford. 


Michael Evans and Jeffrey Rosenthal 
Toronto, 2009 


We have made the book freely available on the web for several years now. This 
version corrects an number of issues pointed out to us by users. We are grateful to all 
those who have brought these to our attention (see the older webpage for individual 
acknowledgements) and we will continue to make corrections as we are so informed. 


Michael Evans and Jeffrey Rosenthal 
Toronto, 2023 


Chapter 1 
Probability Models 


CHAPTER OUTLINE 


Section 1 Probability: A Measure of Uncertainty 
Section 2 Probability Models 

Section 3 Properties of Probability Models 

Section 4 Uniform Probability on Finite Spaces 
Section 5 Conditional Probability and Independence 
Section 6 Continuity of P 

Section 7 Further Proofs (Advanced) 


This chapter introduces the basic concept of the entire course, namely, probability. We 
discuss why probability was introduced as a scientific concept and how it has been 
formalized mathematically in terms of a probability model. Following this we develop 
some of the basic mathematical results associated with the probability model. 


1.1 | Probability: A Measure of Uncertainty 


Often in life we are confronted by our own ignorance. Whether we are pondering 
tonight’s traffic jam, tomorrow’s weather, next week’s stock prices, an upcoming elec- 
tion, or where we left our hat, often we do not know an outcome with certainty. Instead, 
we are forced to guess, to estimate, to hedge our bets. 

Probability is the science of uncertainty. It provides precise mathematical rules for 
understanding and analyzing our own ignorance. It does not tell us tomorrow’s weather 
or next week’s stock prices; rather, it gives us a framework for working with our limited 
knowledge and for making sensible decisions based on what we do and do not know. 

To say there is a 40% chance of rain tomorrow is not to know tomorrow’s weather. 
Rather, it is to know what we do not know about tomorrow’s weather. 

In this text, we will develop a more precise understanding of what it means to say 
there is a 40% chance of rain tomorrow. We will learn how to work with ideas of 
randomness, probability, expected value, prediction, estimation, etc., in ways that are 
sensible and mathematically clear. 


2 Section 1.1: Probability: A Measure of Uncertainty 


There are also other sources of randomness besides uncertainty. For example, com- 
puters often use pseudorandom numbers to make games fun, simulations accurate, and 
searches efficient. Also, according to the modern theory of quantum mechanics, the 
makeup of atomic matter is in some sense truly random. All such sources of random- 
ness can be studied using the techniques of this text. 

Another way of thinking about probability is in terms of relative frequency. For ex- 
ample, to say a coin has a 50% chance of coming up heads can be interpreted as saying 
that, if we flipped the coin many, many times, then approximately half of the time it 
would come up heads. This interpretation has some limitations. In many cases (such 
as tomorrow’s weather or next week’s stock prices), it is impossible to repeat the ex- 
periment many, many times. Furthermore, what precisely does “approximately” mean 
in this case? However, despite these limitations, the relative frequency interpretation is 
a useful way to think of probabilities and to develop intuition about them. 

Uncertainty has been with us forever, of course, but the mathematical theory of 
probability originated in the seventeenth century. In 1654, the Paris gambler Le Cheva- 
lier de Méré asked Blaise Pascal about certain probabilities that arose in gambling 
(such as, if a game of chance is interrupted in the middle, what is the probability that 
each player would have won had the game continued?). Pascal was intrigued and cor- 
responded with the great mathematician and lawyer Pierre de Fermat about these ques- 
tions. Pascal later wrote the book Traité du Triangle Arithmetique, discussing binomial 
coefficients (Pascal’s triangle) and the binomial probability distribution. 

At the beginning of the twentieth century, Russians such as Andrei Andreyevich 
Markov, Andrey Nikolayevich Kolmogorov, and Pafnuty L. Chebychev (and Ameri- 
can Norbert Wiener) developed a more formal mathematical theory of probability. In 
the 1950s, Americans William Feller and Joe Doob wrote important books about the 
mathematics of probability theory. They popularized the subject in the western world, 
both as an important area of pure mathematics and as having important applications in 
physics, chemistry, and later in computer science, economics, and finance. 


1.1.1 | Why Do We Need Probability Theory? 


Probability theory comes up very often in our daily lives. We offer a few examples 
here. 

Suppose you are considering buying a “Lotto 6/49” lottery ticket. In this lottery, 
you are to pick six distinct integers between | and 49. Another six distinct integers 
between | and 49 are then selected at random by the lottery company. If the two sets 
of six integers are identical, then you win the jackpot. 

After mastering Section 1.4, you will know how to calculate that the probability 
of the two sets matching is equal to one chance in 13,983,816. That is, it is about 14 
million times more likely that you will not win the jackpot than that you will. (These 
are not very good odds!) 

Suppose the lottery tickets cost $1 each. After mastering expected values in Chap- 
ter 3, you will know that you should not even consider buying a lottery ticket unless the 
jackpot is more than $14 million (which it usually is not). Furthermore, if the jackpot 
is ever more than $14 million, then likely many other people will buy lottery tickets 


Chapter 1: Probability Models 3 


that week, leading to a larger probability that you will have to share the jackpot with 
other winners even if you do win — so it is probably not in your favor to buy a lottery 
ticket even then. 

Suppose instead that a “friend” offers you a bet. He has three cards, one red on 
both sides, one black on both sides, and one red on one side and black on the other. 
He mixes the three cards in a hat, picks one at random, and places it flat on the table 
with only one side showing. Suppose that one side is red. He then offers to bet his $4 
against your $3 that the other side of the card is also red. 

At first you might think it sounds like the probability that the other side is also red is 
50%; thus, a good bet. However, after mastering conditional probability (Section 1.5), 
you will know that, conditional on one side being red, the conditional probability that 
the other side is also red is equal to 2/3. So, by the theory of expected values (Chap- 
ter 3), you will know that you should not accept your “friend’s” bet. 

Finally, suppose he suggests that you flip a coin one thousand times. Your “friend” 
says that if the coin comes up heads at least six hundred times, then he will pay you 
$100; otherwise, you have to pay him just $1. 

At first you might think that, while 500 heads is the most likely, there is still a 
reasonable chance that 600 heads will appear — at least good enough to justify accept- 
ing your friend’s $100 to $1 bet. However, after mastering the laws of large numbers 
(Chapter 4), you will know that as the number of coin flips gets large, it becomes more 
and more likely that the number of heads is very close to half of the total number of 
coin flips. In fact, in this case, there is less than one chance in ten billion of getting 
more than 600 heads! Therefore, you should not accept this bet, either. 

As these examples show, a good understanding of probability theory will allow you 
to correctly assess probabilities in everyday situations, which will in turn allow you to 
make wiser decisions. It might even save you money! 

Probability theory also plays a key role in many important applications of science 
and technology. For example, the design of a nuclear reactor must be such that the 
escape of radioactivity into the environment is an extremely rare event. Of course, we 
would like to say that it is categorically impossible for this to ever happen, but reac- 
tors are complicated systems, built up from many interconnected subsystems, each of 
which we know will fail to function properly at some time. Furthermore, we can never 
definitely say that a natural event like an earthquake cannot occur that would damage 
the reactor sufficiently to allow an emission. The best we can do is try to quantify our 
uncertainty concerning the failures of reactor components or the occurrence of natural 
events that would lead to such an event. This is where probability enters the picture. 
Using probability as a tool to deal with the uncertainties, the reactor can be designed to 
ensure that an unacceptable emission has an extremely small probability — say, once 
in a billion years — of occurring. 

The gambling and nuclear reactor examples deal essentially with the concept of 
risk — the risk of losing money, the risk of being exposed to an injurious level of 
radioactivity, etc. In fact, we are exposed to risk all the time. When we ride in a car, 
or take an airplane flight, or even walk down the street, we are exposed to risk. We 
know that the risk of injury in such circumstances is never zero, yet we still engage in 
these activities. This is because we intuitively realize that the probability of an accident 
occurring is extremely low. 


4 Section 1.2: Probability Models 


So we are using probability every day in our lives to assess risk. As the problems 
we face, individually or collectively, become more complicated, we need to refine and 
develop our rough, intuitive ideas about probability to form a clear and precise ap- 
proach. This is why probability theory has been developed as a subject. In fact, the 
insurance industry has been developed to help us cope with risk. Probability is the 
tool used to determine what you pay to reduce your risk or to compensate you or your 
family in case of a personal injury. 


Summary of Section 1.1 


e Probability theory provides us with a precise understanding of uncertainty. 


e This understanding can help us make predictions, make better decisions, assess 
risk, and even make money. 


DISCUSSION TOPICS 


1.1.1 Do you think that tomorrow’s weather and next week’s stock prices are “really” 
random, or is this just a convenient way to discuss and analyze them? 

1.1.2 Do you think it is possible for probabilities to depend on who is observing them, 
or at what time? 

1.1.3 Do you find it surprising that probability theory was not discussed as a mathe- 
matical subject until the seventeenth century? Why or why not? 

1.1.4 In what ways is probability important for such subjects as physics, computer 
science, and finance? Explain. 

1.1.5 What are examples from your own life where thinking about probabilities did 
save — or could have saved — you money or helped you to make a better decision? 
(List as many as you can.) 

1.1.6 Probabilities are often depicted in popular movies and television programs. List 
as many examples as you can. Do you think the probabilities were portrayed there in a 
“reasonable” way? 


1.2 | Probability Models 


A formal definition of probability begins with a sample space, often written S. This 
sample space is any set that lists all possible outcomes (or, responses) of some unknown 
experiment or situation. For example, perhaps 


S = {rain, snow, clear} 


when predicting tomorrow’s weather. Or perhaps S is the set of all positive real num- 
bers, when predicting next week’s stock price. The point is, S can be any set at all, 
even an infinite set. We usually write s for an element of S, so that s e S. Note that S 
describes only those things that we are interested in; if we are studying weather, then 
rain and snow are in S, but tomorrow’s stock prices are not. 


Chapter 1: Probability Models 5 


A probability model also requires a collection of events, which are subsets of S 
to which probabilities can be assigned. For the above weather example, the subsets 
{rain}, {snow}, {rain, snow}, {rain, clear}, {rain, snow, clear}, and even the empty set 
Ø = { }, are all examples of subsets of S that could be events. Note that here the comma 
means “or”; thus, {rain, snow} is the event that it will rain or snow. We will generally 
assume that all subsets of S are events. (In fact, in complicated situations there are 
some technical restrictions on what subsets can or cannot be events, according to the 
mathematical subject of measure theory. But we will not concern ourselves with such 
technicalities here.) 

Finally, and most importantly, a probability model requires a probability measure, 
usually written P. This probability measure must assign, to each event A, a probability 
P(A). We require the following properties: 


1. P(A) is always a nonnegative real number, between 0 and 1 inclusive. 
2. P(®) = 0, i.e., if A is the empty set Ø, then P(A) = 0. 
3. P(S) = 1, i.e., if A is the entire sample space S, then P(A) = 1. 


4. P is (countably) additive, meaning that if 41, A2,... is a finite or countable 
sequence of disjoint events, then 


P(4,U AQ U---) = P(A1) +P(A) ++. (1.2.1) 


The first of these properties says that we shall measure all probabilities on a scale 
from 0 to 1, where 0 means impossible and | (or 100%) means certain. The second 
property says the probability that nothing happens is 0; in other words, it is impossible 
that no outcome will occur. The third property says the probability that something 
happens is 1; in other words, it is certain that some outcome must occur. 

The fourth property is the most subtle. It says that we can calculate probabilities 
of complicated events by adding up the probabilities of smaller events, provided those 
smaller events are disjoint and together contain the entire complicated event. Note that 
events are disjoint if they contain no outcomes in common. For example, {rain} and 
{snow, clear} are disjoint, whereas {rain} and {rain, clear} are not disjoint. (We are 
assuming for simplicity that it cannot both rain and snow tomorrow.) Thus, we should 
have P({rain}) + P({snow, clear}) = P({rain, snow, clear}), but do not expect to 
have P({rain}) + P({rain, clear}) = P({rain, rain, clear}) (the latter being the same 
as P({rain, clear})). 

We now formalize the definition of a probability model. 


Definition 1.2.1 A probability model consists of a nonempty set called the sample 
space S; a collection of events that are subsets of S; and a probability measure P 


assigning a probability between 0 and 1 to each event, with P (Ø) = 0 and P(S) = 1 
and with P additive as in (1.2.1). 


6 Section 1.2: Probability Models 


EXAMPLE 1.2.1 
Consider again the weather example, with S = {rain, snow, clear}. Suppose that the 
probability of rain is 40%, the probability of snow is 15%, and the probability of a 
clear day is 45%. We can express this as P({rain}) = 0.40, P({snow}) = 0.15, and 
P({clear}) = 0.45. 

For this example, of course P (Ø) = 0, i.e., it is impossible that nothing will happen 
tomorrow. Also P({rain, snow, clear}) = 1, because we are assuming that exactly 
one of rain, snow, or clear must occur tomorrow. (To be more realistic, we might say 
that we are predicting the weather at exactly 11:00 A.M. tomorrow.) Now, what is the 
probability that it will rain or snow tomorrow? Well, by the additivity property, we see 
that 

P({rain, snow}) = P({rain}) + P({snow}) = 0.40 + 0.15 = 0.55. 


We thus conclude that, as expected, there is a 55% chance of rain or snow tomorrow. E 


EXAMPLE 1.2.2 

Suppose your candidate has a 60% chance of winning an election in progress. Then 
S = {win, lose}, with P (win) = 0.6 and P (lose) = 0.4. Note that P(win)+ P (lose) = 
1.0 


EXAMPLE 1.2.3 

Suppose we flip a fair coin, which can come up either heads (H) or tails (T) with equal 
probability. Then S = {H, T}, with P(H) = P(T) = 0.5. Ofcourse, P(H)+ P(T) = 
1.0 


EXAMPLE 1.2.4 
Suppose we flip three fair coins in a row and keep track of the sequence of heads and 
tails that result. Then 


S ={HHH, HHT, HTH, HTT,THH,THT,TTH, TTT}. 


Furthermore, each of these eight outcomes is equally likely. Thus, P (HH H) = 1/8, 
P(TTT) = 1/8, etc. Also, the probability that the first coin is heads and the second 
coin is tails, but the third coin can be anything, is equal to the sum of the probabilities 
of the events HTH and HTT, i.e., P(HTH) + P(ATT) =1/8 + 1/8=1/4.8 


EXAMPLE 1.2.5 
Suppose we flip three fair coins in a row but care only about the number of heads 
that result. Then S = {0, 1, 2,3}. However, the probabilities of these four outcomes 
are not all equally likely; we will see later that in fact P(0) = P(3) = 1/8, while 
Pd) = PQ) =3/8.8 

We note that it is possible to define probability models on more complicated (e.g., 
uncountably infinite) sample spaces as well. 


EXAMPLE 1.2.6 
Suppose that S = [0, 1] is the unit interval. We can define a probability measure P on 
S by saying that 


P({a,b]) =b-a, whenever 0 <a<b<l. (1.2.2) 


Chapter 1: Probability Models 7 


In words, for any! subinterval [a, b] of [0, 1], the probability of the interval is simply 
the /ength of that interval. This example is called the uniform distribution on [0, 1]. 
The uniform distribution is just the first of many distributions on uncountable state 
spaces. Many further examples will be given in Chapter 2. E 


1.2.1| Venn Diagrams and Subsets 


Venn diagrams provide a very useful graphical method for depicting the sample space 
S and subsets of it. For example, in Figure 1.2.1 we have a Venn diagram showing the 
subset A C S and the complement 


AS ={s:s ¢ A} 


of A. The rectangle denotes the entire sample space S. The circle (and its interior) de- 
notes the subset 4; the region outside the circle, but inside S, denotes A°. 


S 


AS 


Figure 1.2.1: Venn diagram of the subsets A and A° of the sample space S. 


Two subsets A C S and B C S are depicted as two circles, as in Figure 1.2.2 on 
the next page. The intersection 


ANB={s:s €Aands €B} 


of the subsets A and B is the set of elements common to both sets and is depicted by 
the region where the two circles overlap. The set 


A\B® ={s:seAands ¢ B} 


is called the complement of B in A and is depicted as the region inside the A circle, 
but not inside the B circle. This is the set of elements in A but not in B. Similarly, we 
have the complement of A in B, namely, A} N B. Observe that the sets AN B, AN BY, 
and A° N B are mutually disjoint. 


lFor the uniform distribution on [0, 1], it turns out that not all subsets of [0, 1] can properly be regarded 
as events for this model. However, this is merely a technical property, and any subset that we can explicitly 
write down will always be an event. See more advanced probability books, e.g., page 3 of A First Look at 
Rigorous Probability Theory, Second Edition, by J. S. Rosenthal (World Scientific Publishing, Singapore, 
2006). 


8 Section 1.2: Probability Models 


The union 
AUB={s:seAors € B} 


of the sets A and B is the set of elements that are in either A or B. In Figure 1.2.2, it 
is depicted by the region covered by both circles. Notice that Æ UB = (AN B°) U 
(AN B)U(AS NB). 
There is one further region in Figure 1.2.2. This is the complement of A U B, 
namely, the set of elements that are in neither A nor B. So we immediately have 
(AUB) = ASN BY. 
Similarly, we can show that 
(ANB) = AS UBS, 


namely, the subset of elements that are not in both A and B is given by the set of ele- 
ments not in A or not in B. 


SN 


Ao a BS 
Figure 1.2.2: Venn diagram depicting the subsets 4, B, AN B, AN B®, ASN B, ASN BS, 
and AUB. 


Finally, we note that if A and B are disjoint subsets, then it makes sense to depict 
these as drawn in Figure 1.2.3, i.e., as two nonoverlapping circles because they have 
no elements in common. 


Figure 1.2.3: Venn diagram of the disjoint subsets A and B. 


Chapter 1: Probability Models 9 


Summary of Section 1.2 


e A probability model consists of a sample space S and a probability measure P 
assigning probabilities to each event. 


e Different sorts of sets can arise as sample spaces. 


e Venn diagrams provide a convenient method for representing sets and the rela- 
tionships among them. 


EXERCISES 


1.2.1 Suppose S = {1, 2,3}, with P({1}) = 1/2, P({2}) = 1/3, and P({3}) = 1/6. 
(a) What is P({1, 2})? 

(b) What is P({1, 2, 3})? 

(c) List all events A such that P(A) = 1/2. 

1.2.2 Suppose S = {1,2, 3,4, 5, 6, 7, 8}, with P({s}) = 1/8 forl < s <8. 

(a) What is P({1, 2})? 

(b) What is P({1, 2, 3})? 

(c) How many events A are there such that P(A) = 1/2? 

1.2.3 Suppose S = {1,2,3}, with P({1}) = 1/2 and P({1,2}) = 2/3. What must 
P({2}) be? 

1.2.4 Suppose S$ = {1, 2,3}, and we try to define P by P({1, 2, 3}) = 1, P({1,2}) = 
0.7, P({1, 3}) = 0.5, P({2, 3}) = 0.7, P({1}) = 0.2, P({2}) = 0.5, P({3}) = 0.3. Is 
P a valid probability measure? Why or why not? 

1.2.5 Consider the uniform distribution on [0, 1]. Let s € [0, 1] be any outcome. What 
is P({s})? Do you find this result surprising? 

1.2.6 Label the subregions in the Venn diagram in Figure 1.2.4 using the sets A, B, and 
C and their complements (just as we did in Figure 1.2.2). 

1.2.7 Ona Venn diagram, depict the set of elements that are in subsets A or B but not 
in both. Also write this as a subset involving unions and intersections of A, B, and 
their complements. 

1.2.8 Suppose S = {1,2,3}, and P({1,2}) = 1/3, and P({2,3}) = 2/3. Compute 
P({1}), P({2}), and P({3}). 

1.2.9 Suppose S = {1,2,3,4}, and P({1}) = 1/12, and P({1,2}) = 1/6, and 
P({1, 2, 3}) = 1/3. Compute P({1}), P({2}), P({3}), and P({4}). 

1.2.10 Suppose S = {1,2,3}, and PQI) = P({3}) = 2 P({2}). Compute P({1}), 
P({2}), and P({3}). 

1.2.11 Suppose S = {1,2,3}, and P({1}) = P({2}) + 1/6, and P({3}) = 2 P({2}). 
Compute P({1}), P({2}), and P ({3}). 

1.2.12 Suppose S = {1, 2,3, 4}, and P({1}) —1/8 = P({2}) = 3 P({3}) = 4 P({4}). 
Compute P({1}), P({2}), P({3}), and P({4}). 


10 Section 1.3: Properties of Probability Models 


Figure 1.2.4: Venn diagram of subsets A, B, and C. 


PROBLEMS 


1.2.13 Consider again the uniform distribution on [0, 1]. Is it true that 


P(0,1) = >> Ps}? 


sé[0,1] 


How does this relate to the additivity property of probability measures? 

1.2.14 Suppose S is a finite or countable set. Is it possible that P({s}) = 0 for every 
single s e $? Why or why not? 

1.2.15 Suppose S is an uncountable set. Is it possible that P ({s}) = 0 for every single 
s € S? Why or why not? 


DISCUSSION TOPICS 


1.2.16 Does the additivity property make sense intuitively? Why or why not? 


1.2.17 Is it important that we always have P(S) = 1? How would probability theory 
change if this were not the case? 


1.3 | Properties of Probability Models 


The additivity property of probability measures automatically implies certain basic 
properties. These are true for any probability model at all. 

If A is any event, we write A‘ (read “A complement”) for the event that A does not 
occur. In the weather example, if A = {rain}, then A© = {snow, clear}. In the coin 
examples, if A is the event that the first coin is heads, then A‘ is the event that the first 
coin is tails. 

Now, A and A® are always disjoint. Furthermore, their union is always the entire 
sample space: A U A° = S. Hence, by the additivity property, we must have P(A) + 


Chapter 1: Probability Models 11 


P(A‘) = P(S). But we always have P(S) = 1. Thus, P(A) + P(4°) = 1, or 
P(A) =1- P(A). (1.3.1) 


In words, the probability that any event does not occur is equal to one minus the prob- 
ability that it does occur. This is a very helpful fact that we shall use often. 

Now suppose that 41, A2,... are events that form a partition of the sample space 
S. This means that 41, A2,... are disjoint and, furthermore, that their union is equal 
to S, i.e., Aj U A2 U --- = S. We have the following basic theorem that allows us to 
decompose the calculation of the probability of B into the sum of the probabilities of 
the sets A; N B. Often these are easier to compute. 


Theorem 1.3.1 (Law of total probability, unconditioned version) Let A1, A2,... 
be events that form a partition of the sample space S. Let B be any event. Then 


P(B) = P(41NB)+ P(4208B)+---. 


PROOF | The events (4, 9B), (428), . . . are disjoint, and their union is B. Hence, 
the result follows immediately from the additivity property (1.2.1). E 


A somewhat more useful version of the law of total probability, and applications of its 
use, are provided in Section 1.5. 

Suppose now that A and B are two events such that A contains B (in symbols, 
A D B). In words, all outcomes in B are also in A. Intuitively, A is a “larger” event 
than B, so we would expect its probability to be larger. We have the following result. 


Theorem 1.3.2 Let A and B be two events with A D B. Then 


P(A) = P(B) + P(AN B®). 


PROOF | We can write A = BU (AN B®), where B and AN B® are disjoint. Hence, 
P(A) = P(B) + P(AN B°) by additivity. E 


Because we always have P (4 N B®) > 0, we conclude the following. 


Corollary 1.3.1 (Monotonicity) Let A and B be two events, with A D B. Then 


P(A) > P(B). 
On the other hand, rearranging (1.3.2), we obtain the following. 


Corollary 1.3.2 Let A and B be two events, with A D B. Then 


P(AN B°) = P(A) — P(B). 


More generally, even if we do not have A D B, we have the following property. 


12 Section 1.3: Properties of Probability Models 


Theorem 1.3.3 (Principle of inclusion—exclusion, two-event version) Let A and B 
be two events. Then 


P(AUB) = P(A) + P(B) — P(ANB). (1.3.4) 


PROOF | We can write A U B = (AN B°) U (B N A°) U (ANB), where A N BS, 
B N A‘, and AN B are disjoint. By additivity, we have 


P(AUB) = P(AN B®) + P(BANA)+P(4AB). (1.3.5) 
On the other hand, using Corollary 1.3.2 (with B replaced by 4 N B), we have 
P(AN B) = P(AN (AN B)) = P(A) — P(4A B) (1.3.6) 
and similarly, 
P(B N A‘) = P(B) — P(4A B). (1.3.7) 
Substituting (1.3.6) and (1.3.7) into (1.3.5), the result follows. E 


A more general version of the principle of inclusion—exclusion is developed in Chal- 
lenge 1.3.10. 

Sometimes we do not need to evaluate the probability content of a union; we need 
only know it is bounded above by the sum of the probabilities of the individual events. 
This is called subadditivity. 


Theorem 1.3.4 (Subadditivity) Let A1, A2,... be a finite or countably infinite se- 
quence of events, not necessarily disjoint. Then 


P(A, U A2U---) < P(A1) + P(42)+---. 


PROOF | See Section 1.7 for the proof of this result. E 


We note that some properties in the definition ofa probability model actually follow 
from other properties. For example, once we know the probability P is additive and 
that P(S) = 1, it follows that we must have P(@) = 0. Indeed, because S and Ø are 
disjoint, P(S U Ø) = P(S) + P(@). But of course, P(S U Ø) = P(S) = 1, so we 
must have P (Ø) = 0. 

Similarly, once we know P is additive on countably infinite sequences of disjoint 
events, it follows that P must be additive on finite sequences of disjoint events, too. 
Indeed, given a finite disjoint sequence 41,..., An, we can just set 4; = Ø for all 
i > n, to get a countably infinite disjoint sequence with the same union and the same 
sum of probabilities. 


Summary of Section 1.3 


e The probability of the complement of an event equals one minus the probability 
of the event. 


Chapter 1: Probability Models 13 


e Probabilities always satisfy the basic properties of total probability, subadditivity, 
and monotonicity. 


e The principle of inclusion—exclusion allows for the computation of P(A U B) in 
terms of simpler events. 


EXERCISES 


1.3.1 Suppose S = {1,2,..., 100}. Suppose further that P({1}) = 0.1. 

(a) What is the probability P({2, 3,4, ..., 100})? 

(b) What is the smallest possible value of P({1, 2, 3})? 

1.3.2 Suppose that Al watches the six o’clock news 2/3 of the time, watches the eleven 
o’clock news 1/2 of the time, and watches both the six o’clock and eleven o’clock news 
1/3 of the time. For a randomly selected day, what is the probability that Al watches 
only the six o’clock news? For a randomly selected day, what is the probability that Al 
watches neither news? 

1.3.3 Suppose that an employee arrives late 10% of the time, leaves early 20% of the 
time, and both arrives late and leaves early 5% of the time. What is the probability that 
on a given day that employee will either arrive late or leave early (or both)? 

1.3.4 Suppose your right knee is sore 15% of the time, and your left knee is sore 10% 
of the time. What is the largest possible percentage of time that at least one of your 
knees is sore? What is the smallest possible percentage of time that at least one of your 
knees is sore? 

1.3.5 Suppose a fair coin is flipped five times in a row. 

(a) What is the probability of getting all five heads? 

(b) What is the probability of getting at least one tail? 

1.3.6 Suppose a card is chosen uniformly at random from a standard 52-card deck. 

(a) What is the probability that the card is a jack? 

(b) What is the probability that the card is a club? 

(c) What is the probability that the card is both a jack and a club? 

(d) What is the probability that the card is either a jack or a club (or both)? 

1.3.7 Suppose your team has a 40% chance of winning or tying today’s game and has 
a 30% chance of winning today’s game. What is the probability that today’s game will 
be a tie? 

1.3.8 Suppose 55% of students are female, of which 4/5 (44%) have long hair, and 45% 
are male, of which 1/3 (15% of all students) have long hair. What is the probability 
that a student chosen at random will either be female or have long hair (or both)? 


PROBLEMS 


1.3.9 Suppose we choose a positive integer at random, according to some unknown 
probability distribution. Suppose we know that P({1, 2,3, 4, 5}) = 0.3, that P({4, 5, 6}) 
= 0.4, and that P({1}) = 0.1. What are the largest and smallest possible values of 


P({2})? 
CHALLENGES 


14 Section 1.4: Uniform Probability on Finite Spaces 


1.3.10 Generalize the principle of inclusion—exclusion, as follows. 
(a) Suppose there are three events A, B, and C. Prove that 


P(AUBUC) = P(A) + P(B) + PCC) — P(ANB) - P(ANC) 
—P(BNC)+P(ANBNC). 
(b) Suppose there are n events 41, A2,..., An. Prove that 


n n n 
P(A, U UAn) = > P(A) — P(A; A) + J PUNAN A) 
i=l i j=l 


i j,k=1 
i<j i<j<k 


—.--+ P(4 N- N An). 
(Hint: Use induction.) 


DISCUSSION TOPICS 


1.3.11 Of the various theorems presented in this section, which ones do you think are 
the most important? Which ones do you think are the least important? Explain the 
reasons for your choices. 


1.4 | Uniform Probability on Finite Spaces 


If the sample space S is finite, then one possible probability measure on S is the uniform 
probability measure, which assigns probability 1/|S] to each outcome. Here |S] is the 
number of elements in the sample space S. By additivity, it then follows that for any 
event A we have | 
P(A) = —. 1.4.1 

=F (1.4.1) 
EXAMPLE 1.4.1 
Suppose we roll a six-sided die. The possible outcomes are S = {1,2,3,4,5, 6}, so 
that |S| = 6. If the die is fair, then we believe each outcome is equally likely. We thus 
set P({i}) = 1/6 for each i € S so that P({3}) = 1/6, P({4}) = 1/6, etc. It follows 
from (1.4.1) that, for example, P({3, 4}) = 2/6 = 1/3, P({1, 5, 6}) = 3/6 = 1/2, etc. 
This is a good model of rolling a fair six-sided die once. E 
EXAMPLE 1.4.2 
For a second example, suppose we flip a fair coin once. Then S' = {heads, tails}, so 
that |S] = 2, and P({heads}) = P({tails}) = 1/2. E 
EXAMPLE 1.4.3 


Suppose now that we flip three different fair coins. The outcome can be written as a 
sequence of three letters, with each letter being H (for heads) or T (for tails). Thus, 


S = (HHH, HHT, HTH, HTT, THH,THT,TTH, TTT}. 


Here |S| = 8, and each of the events is equally likely. Hence, P({HHH}) = 1/8, 
P({HHH,TTT}) = 2/8 = 1/4, etc. Note also that, by additivity, we have, for 
example, that P (exactly two heads) = P({HHT, HTH,THH}) = 1/8 + 1/8 + 
1/8 = 3/8, etc. E 


Chapter 1: Probability Models 15 


EXAMPLE 1.4.4 
For a final example, suppose we roll a fair six-sided die and flip a fair coin. Then we 
can write 


S = {1H, 2H, 3H, 4H, 5H, 6H, 17T,27T,37,4T,5T, 6T}. 


Hence, |S] = 12 in this case, and P (s) = 1/12 for eachs e S.E 


1.4.1 | Combinatorial Principles 


Because of (1.4.1), problems involving uniform distributions on finite sample spaces 
often come down to being able to compute the sizes |A| and |S] of the sets involved. 
That is, we need to be good at counting the number of elements in various sets. The 
science of counting is called combinatorics, and some aspects of it are very sophisti- 
cated. In the remainder of this section, we consider a few simple combinatorial rules 
and their application in probability theory when the uniform distribution is appropriate. 


EXAMPLE 1.4.5 Counting Sequences: The Multiplication Principle 

Suppose we flip three fair coins and roll two fair six-sided dice. What is the prob- 
ability that all three coins come up heads and that both dice come up 6? Each coin 
has two possible outcomes (heads and tails), and each die has six possible outcomes 
{1,2,3, 4,5, 6}. The total number of possible outcomes of the three coins and two dice 
is thus given by multiplying three 2’s and two 6’s, i.e., 2 x 2 x 2 x 6 x 6 = 288. This is 
sometimes referred to as the multiplication principle. There are thus 288 possible out- 
comes of our experiment (e.g., HH H66, HT H24, TT H15, etc.). Of these outcomes, 
only one (namely, H H H66) counts as a success. Thus, the probability that all three 
coins come up heads and both dice come up 6 is equal to 1/288. 

Notice that we can obtain this result in an alternative way. The chance that any 
one of the coins comes up heads is 1/2, and the chance that any one die comes up 6 is 
1/6. Furthermore, these events are all independent (see the next section). Under inde- 
pendence, the probability that they a// occur is given by the product of their individual 
probabilities, namely, 


(1/2)(1/2)(1/2) (1/6) (1/6) = 1/288. 


More generally, suppose we have k finite sets S,,..., Sy and we want to count the 
number of sequences of length k where the ith element comes from S,, i.e., count the 
number of elements in 


S = {(1,..., Sk) 1 Si E Sj} = Sp x +--+ X Sk. 


The multiplication principle says that the number of such sequences is obtained by 
multiplying together the number of elements in each set Sj, i.e., 


|S] = |Si]--- [Sx]. 


16 Section 1.4: Uniform Probability on Finite Spaces 


EXAMPLE 1.4.6 

Suppose we roll two fair six-sided dice. What is the probability that the sum of the 
numbers showing is equal to 10? By the above multiplication principle, the total 
number of possible outcomes is equal to 6 x 6 = 36. Of these outcomes, there are 
three that sum to 10, namely, (4, 6), (5,5), and (6, 4). Thus, the probability that the 
sum is 10 is equal to 3/36, or 1/12. E 


EXAMPLE 1.4.7 Counting Permutations 

Suppose four friends go to a restaurant, and each checks his or her coat. At the end 
of the meal, the four coats are randomly returned to the four people. What is the 
probability that each of the four people gets his or her own coat? Here the total number 
of different ways the coats can be returned is equal to 4 x 3 x 2 x 1, or 4! (i.e., four 
factorial). This is because the first coat can be returned to any of the four friends, 
the second coat to any of the three remaining friends, and so on. Only one of these 
assignments is correct. Hence, the probability that each of the four people gets his or 
her own coat is equal to 1/4!, or 1/24. 

Here we are counting permutations, or sequences of elements from a set where 
no element appears more than once. We can use the multiplication principle to count 
permutations more generally. For example, suppose |S] = n and we want to count the 
number of permutations of length k < n obtained from S, i.e., we want to count the 
number of elements of the set 


{(s1,...,54) 187 € S, si #5; wheni # j}. 


Then we have n choices for the first element sı, n — 1 choices for the second ele- 
ment, and finally n — (k — 1) = n — k + 1 choices for the last element. So there are 
n(n — 1) --- (n — k +1) permutations of length k from a set of n elements. This can 
also be written as n!/(n — k)!. Notice that when k = n, there are 


nl}=n(n—1)---2-1 
permutations of length n. E 


EXAMPLE 1.4.8 Counting Subsets 

Suppose 10 fair coins are flipped. What is the probability that exactly seven of them 
are heads? Here each possible sequence of 10 heads or tails (e.g, HHHTTTHTTT, 
THTTTT HH HT, etc.) is equally likely, and by the multiplication principle the total 
number of possible outcomes is equal to 2 multiplied by itself 10 times, or 2!° = 1024. 
Hence, the probability of any particular sequence occurring is 1/1024. But of these 
sequences, how many have exactly seven heads? 

To answer this, notice that we may specify such a sequence by giving the positions 
of the seven heads, which involves choosing a subset of size 7 from the set of possible 
indices {1,..., 10}. There are 10!/3! = 10-9---5-4 different permutations of length 
7 from {1,..., 10}, and each such permutation specifies a sequence of seven heads 
and three tails. But we can permute the indices specifying where the heads go in 7! 
different ways without changing the sequence of heads and tails. So the total number 
of outcomes with exactly seven heads is equal to 10! /3!7! = 120. The probability that 
exactly seven of the 10 coins are heads is therefore equal to 120/1024, or just under 
12%. 


Chapter 1: Probability Models 17 


In general, if we have a set S of n elements, then the number of different subsets of 
size k that we can construct by choosing elements from S is 


n\ _ n! 
(o) k(n kb? 


which is called the binomial coefficient. This follows by the same argument, namely, 
there are n!/(n — k)! permutations of length k obtained from the set; each such permu- 
tation, and the k! permutations obtained by permuting it, specify a unique subset of S. 
| 


It follows, for example, that the probability of obtaining exactly k heads when 
flipping a total of n fair coins is given by 


n a n! a- 
(o kak 


This is because there are W) different patterns of k heads and n — k tails, and a total of 
2” different sequences of n heads and tails. 

More generally, if each coin has probability 0 of being heads (and probability 1 — 0 
of being tails), where 0 < 0 < 1, then the probability of obtaining exactly k heads 
when flipping a total of n such coins is given by 


n \ gk n-k _ n! kq gyn—k 
(2 a-o =z . = a-6)"*, (1.4.2) 


because each of the (%) different pattems of k heads and n — k tails has probability 
6* (1 —0)"~* of occurring (this follows from the discussion of independence in Section 
1.5.2). If 9 = 1/2, then this reduces to the previous formula. 


EXAMPLE 1.4.9 Counting Sequences of Subsets and Partitions 
Suppose we have a set S of n elements and we want to count the number of elements 
of 

{(S1, %,..., S1) : S; C S, S| = ki, Si N Sj = Ø when i Łj}, 


namely, we want to count the number of sequences of / subsets of a set where no 
two subsets have any elements in common and the ith subset has k; elements. By the 
multiplication principle, this equals 


(0) es es e) 


n! 
SS eee (1.4.3) 
ky!-++ky_y!ky! (n — ky — +--+ — ky)! 
because we can choose the elements of Sı in (i, ) ways, choose the elements of Sz in 
Ce 1) ways, etc. 
When we have that S = S1 U $2 U - - - U S), in addition to the individual sets being 
mutually disjoint, then we are counting the number of ordered partitions of a set of n 


18 Section 1.4: Uniform Probability on Finite Spaces 


elements with kı elements in the first set, k2 elements in the second set, etc. In this 


case, (1.4.3) equals 
n n! 
= ————, (1.4.4) 
kiko... ky ky'ko!--- ky! 


which is called the multinomial coefficient. E 

For example, how many different bridge hands are there? By this we mean how 
many different ways can a deck of 52 cards be divided up into four hands of 13 cards 
each, with the hands labelled North, East, South, and West, respectively. By (1.4.4), 
this equals 


52 52! be 
SS OMS 10%, 
13131313) 13113113113! 


which is a very large number. 


Summary of Section 1.4 


e The uniform probability distribution on a finite sample space S satisfies P(A) = 
|A|/IS]. 

e Computing P(A) in this case requires computing the sizes of the sets A and S. 
This may require combinatorial principles such as the multiplication principle, 
factorials, and binomial/multinomial coefficients. 


EXERCISES 


1.4.1 Suppose we roll eight fair six-sided dice. 

(a) What is the probability that all eight dice show a 6? 

(b) What is the probability that all eight dice show the same number? 

(c) What is the probability that the sum of the eight dice is equal to 9? 

1.4.2 Suppose we roll 10 fair six-sided dice. What is the probability that there are 
exactly two 2’s showing? 

1.4.3 Suppose we flip 100 fair independent coins. What is the probability that at least 
three of them are heads? (Hint: You may wish to use (1.3.1).) 

1.4.4 Suppose we are dealt five cards from an ordinary 52-card deck. What is the 
probability that 

(a) we get all four aces, plus the king of spades? 

(b) all five cards are spades? 

(c) we get no pairs (i.e., all five cards are different values)? 

(d) we get a full house (i.e., three cards of a kind, plus a different pair)? 

1.4.5 Suppose we deal four 13-card bridge hands from an ordinary 52-card deck. What 
is the probability that 

(a) all 13 spades end up in the same hand? 

(b) all four aces end up in the same hand? 

1.4.6 Suppose we pick two cards at random from an ordinary 52-card deck. What 
is the probability that the sum of the values of the two cards (where we count jacks, 
queens, and kings as 10, and count aces as 1) is at least 4? 


Chapter 1: Probability Models 19 


1.4.7 Suppose we keep dealing cards from an ordinary 52-card deck until the first jack 
appears. What is the probability that at least 10 cards go by before the first jack? 

1.4.8 In a well-shuffled ordinary 52-card deck, what is the probability that the ace of 
spades and the ace of clubs are adjacent to each other? 

1.4.9 Suppose we repeatedly roll two fair six-sided dice, considering the sum of the 
two values showing each time. What is the probability that the first time the sum is 
exactly 7 is on the third roll? 

1.4.10 Suppose we roll three fair six-sided dice. What is the probability that two of 
them show the same value, but the third one does not? 

1.4.11 Consider two urns, labelled urn #1 and urn #2. Suppose urn #1 has 5 red and 
7 blue balls. Suppose urn #2 has 6 red and 12 blue balls. Suppose we pick three balls 
uniformly at random from each of the two urns. What is the probability that all six 
chosen balls are the same color? 

1.4.12 Suppose we roll a fair six-sided die and flip three fair coins. What is the proba- 
bility that the total number of heads is equal to the number showing on the die? 

1.4.13 Suppose we flip two pennies, three nickels, and four dimes. What is the proba- 
bility that the total value of all coins showing heads is equal to $0.31? 


PROBLEMS 


1.4.14 Show that a probability measure defined by (1.4.1) is always additive in the 
sense of (1.2.1). 

1.4.15 Suppose we roll eight fair six-sided dice. What is the probability that the sum 
of the eight dice is equal to 9? What is the probability that the sum of the eight dice is 
equal to 10? What is the probability that the sum of the eight dice is equal to 11? 
1.4.16 Suppose we roll one fair six-sided die, and flip six coins. What is the probability 
that the number of heads is equal to the number showing on the die? 

1.4.17 Suppose we roll 10 fair six-sided dice. What is the probability that there are 
exactly two 2’s showing and exactly three 3’s showing? 

1.4.18 Suppose we deal four 13-card bridge hands from an ordinary 52-card deck. 
What is the probability that the North and East hands each have exactly the same num- 
ber of spades? 

1.4.19 Suppose we pick a card at random from an ordinary 52-card deck and also flip 
10 fair coins. What is the probability that the number of heads equals the value of the 
card (where we count jacks, queens, and kings as 10, and count aces as 1)? 


CHALLENGES 


1.4.20 Suppose we roll two fair six-sided dice and flip 12 coins. What is the probability 
that the number of heads is equal to the sum of the numbers showing on the two dice? 
1.4.21 (The birthday problem) Suppose there are C people, each of whose birthdays 
(month and day only) are equally likely to fall on any of the 365 days of a normal (i.e., 
non-leap) year. 

(a) Suppose C = 2. What is the probability that the two people have the same exact 
birthday? 


20 Section 1.5: Conditional Probability and Independence 


(b) Suppose C > 2. What is the probability that all C people have the same exact 
birthday? 

(c) Suppose C > 2. What is the probability that some pair of the C people have the 
same exact birthday? (Hint: You may wish to use (1.3.1).) 

(d) What is the smallest value of C such that the probability in part (c) is more than 
0.5? Do you find this result surprising? 


1.5 | Conditional Probability and Independence 


Consider again the three-coin example as in Example 1.4.3, where we flip three differ- 
ent fair coins, and 


S ={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}, 


with P(s) = 1/8 for each s e S. What is the probability that the first coin comes 
up heads? Well, of course, this should be 1/2. We can see this more formally by 
saying that P (first coin heads) = P({HHH, HHT, HTH, HTT}) = 4/8 = 1/2, as 
it should. 

But suppose now that an informant tells us that exactly two of the three coins came 
up heads. Now what is the probability that the first coin was heads? 

The point is that this informant has changed our available information, i.e., changed 
our level of ignorance. It follows that our corresponding probabilities should also 
change. Indeed, if we know that exactly two of the coins were heads, then we know 
that the outcome was one of HHT, HT H, and T HH. Because those three outcomes 
should (in this case) still all be equally likely, and because only the first two correspond 
to the first coin being heads, we conclude the following: If we know that exactly two 
of the three coins are heads, then the probability that the first coin is heads is 2/3. 

More precisely, we have computed a conditional probability. That is, we have de- 
termined that, conditional on knowing that exactly two coins came up heads, the con- 
ditional probability of the first coin being heads is 2/3. We write this in mathematical 
notation as 

P (first coin heads | two coins heads) = 2/3. 


Here the vertical bar | stands for “conditional on,” or “given that.” 


1.5.1 | Conditional Probability 


In general, given two events 4 and B with P(B) > 0, the conditional probability of 
A given B, written P(A | B), stands for the fraction of the time that A occurs once we 
know that B occurs. It is computed as the ratio of the probability that A and B both 
occur, divided by the probability that B occurs, as follows. 


Definition 1.5.1 Given two events 4 and B, with P(B) > 0, the conditional prob- 
ability of A given B is equal to 


P(A|B)= apa (1.5.1) 


Chapter 1: Probability Models 21 


The motivation for (1.5.1) is as follows. The event B will occur a fraction P(B) of 
the time. Also, both A and B will occur a fraction P(A N B) of the time. The ratio 
P(A N B)/P(B) thus gives the proportion of the times when B occurs, that A also 
occurs. That is, if we ignore all the times that B does not occur and consider only those 
times that B does occur, then the ratio P(A N B)/P(B) equals the fraction of the time 
that A will also occur. This is precisely what is meant by the conditional probability of 
A given B. 

In the example just computed, A is the event that the first coin is heads, while B 
is the event that exactly two coins were heads. Hence, in mathematical terms, 4 = 
{HHH, HHT, ATH, HTT} and B = {HHT, HTH,THH}. It follows that 4 N 
B = {HHT, HT H}. Therefore, 

P(ANB) PCHAT, HTH} 2/8 


=< =2/3, 


POIDS Bo) PUTAT HTH, THA) — 3/8 


as already computed. 
On the other hand, we similarly compute that 


P (first coin tails | two coins heads) = 1/3. 


We thus see that conditioning on some event (such as “two coins heads”) can make 
probabilities either increase (as for the event “first coin heads”) or decrease (as for the 
event “first coin tails”). 

The definition of P(B | A) immediately leads to the multiplication formula 


P(ANB) = P(A)P(B| A). (1.5.2) 


This allows us to compute the joint probability of A and B when we are given the 
probability of A and the conditional probability of B given A. 

Conditional probability allows us to express Theorem 1.3.1, the law of total proba- 
bility, in a different and sometimes more helpful way. 


Theorem 1.5.1 (Law of total probability, conditioned version) Let 41, A2,... be 


events that form a partition of the sample space S, each of positive probability. Let 
B be any event. Then P(B) = P(41)P (B | 41) + P(42)P(B|A2) +---. 


PROOF | The multiplication formula (1.5.2) gives that P(A;NB) = P(A;)P(B | Ai). 
The result then follows immediately from Theorem 1.3.1. B 


EXAMPLE 1.5.1 

Suppose a class contains 60% girls and 40% boys. Suppose that 30% of the girls have 
long hair, and 20% of the boys have long hair. A student is chosen uniformly at random 
from the class. What is the probability that the chosen student will have long hair? 

To answer this, we let 4; be the set of girls and 42 be the set of boys. Then 
{41, 42} is a partition of the class. We further let B be the set of all students with long 
hair. 

We are interested in P(B). We compute this by Theorem 1.5.1 as 


P(B) = P(A})P(B| 41) + P(42)P(B | 42) = (0.6)(0.3) + (0.4) (0.2) = 0.26, 


22 Section 1.5: Conditional Probability and Independence 


so there is a 26% chance that the randomly chosen student has long hair. I 


Suppose now that A and B are two events, each of positive probability. In some ap- 
plications, we are given the values of P(A), P(B), and P(B | A) and want to compute 
P(A|B). The following result establishes a simple relationship among these quanti- 
ties. 


Theorem 1.5.2 (Bayes’ theorem) Let A and B be two events, each of positive prob- 
ability. Then 


P(A) 
P(4 |B) = —— PB A). 


P(B) 


We compute that 
PA, pias P(A) P(ANB) = P(ANB) 
P(B) P(B) P(A) P(B) 
This gives the result. E 
Standard applications of the multiplication formula, the law of total probabilities, 
and Bayes’ theorem occur with two-stage systems. The response for such systems can 
be thought of as occurring in two steps or stages. Typically, we are given the prob- 
abilities for the first stage and the conditional probabilities for the second stage. The 
multiplication formula is then used to calculate joint probabilities for what happens at 
both stages; the law of total probability is used to compute the probabilities for what 
happens at the second stage; and Bayes’ theorem is used to calculate the conditional 
probabilities for the first stage, given what has occurred at the second stage. We illus- 
trate this by an example. 


EXAMPLE 1.5.2 
Suppose urn #1 has 3 red and 2 blue balls, and urn #2 has 4 red and 7 blue balls. 
Suppose one of the two urns is selected with probability 1/2 each, and then one of the 
balls within that urn is picked uniformly at random. 

What is the probability that urn #2 is selected at the first stage (event A) and a blue 
ball is selected at the second stage (event B)? The multiplication formula provides the 
correct way to compute this probability as 


= P(A|B). 


17 _ 7 
211 22` 

Suppose instead we want to compute the probability that a blue ball is obtained. 
Using the law of total probability (Theorem 1.5.1), we have that 
1232 17 
25 i 211° 

Now suppose we are given the information that the ball picked is blue. Then, using 
Bayes’ theorem, the conditional probability that we had selected urn #2 is given by 


pai = 24 preja- (Z 
P(B) (1/2)(2/5) + 1/2)(7/11) J 11 


= 35/57 = 0.614. 


P(ANB)= P(A)P(B|A) = 


P(B) = P(A)P(B| A) + P(A) P(B| A9) = 


Chapter 1: Probability Models 23 


Note that, without the information that a blue ball occurred at the second stage, we 
have that 
P (urn #2 selected) = 1/2. 


We see that knowing the ball was blue significantly increases the probability that urn 
#2 was selected. E 


We can represent a two-stage system using a tree, as in Figure 1.5.1. It can be help- 
ful to draw such a figure when carrying out probability computations for such systems. 
There are two possible outcomes at the first stage and three possible outcomes at the 
second stage. 


first stage first stage 
outcome 1 outcome 2 
second stage second stage second stage second stage second stage second stage 
outcome 1 outcome 2 outcome 3 outcome 1 outcome 2 outcome 3 


Figure 1.5.1: A tree depicting a two-stage system with two possible outcomes at the first stage 
and three possible outcomes at the second stage. 


1.5.2 | Independence of Events 
Consider now Example 1.4.4, where we roll one fair die and flip one fair coin, so that 
S = {1H,2H,3H,4H,5H,6H,17,27,37,4T, 57, 6T} 


and P({s}) = 1/12 for each s e S. Here the probability that the die comes up 5 is 
equal to P({5H, 5T}) = 2/12 = 1/6, as it should be. 

But now, what is the probability that the die comes up 5, conditional on knowing 
that the coin came up tails? Well, we can compute that probability as 


P (die = 5 and coin = tails) 
P (coin = tails) 


P (die = 5 | coin = tails) 


= P({5T}) 

~ P({IT, 2T, 3T, 4, 5T, 6T}) 
ae 

=o /6. 


This is the same as the unconditional probability, P(die = 5). It seems that knowing 
that the coin was tails had no effect whatsoever on the probability that the coin came 


24 Section 1.5: Conditional Probability and Independence 


up 5. This property is called independence. We say that the coin and the die are 
independent in this example, to indicate that the occurrence of one does not have any 
influence on the probability of the other occurring. 

More formally, we make the following definition. 


Definition 1.5.2 Two events A and B are independent if 


P(ANMB) = P(A) P(B). 


Now, because P(A | B) = P(A N B)/P(B), we see that A and B are independent 
if and only if P(A|B) = P(A) or P(B | A) = P(B), provided that P(A) > 0 and 
P(B) > 0. Definition 1.5.2 has the advantage that it remains valid even if P(B) = 0 
or P(A) = 0, respectively. Intuitively, events A and B are independent if neither one 
has any impact on the probability of the other. 


EXAMPLE 1.5.3 
In Example 1.4.4, if A is the event that the die was 5, and B is the event that the coin 
was tails, then P(A) = P({S5H, 5T}) = 2/12 = 1/6, and 


P(B) = P({1T, 27, 37, 4T, 5T, 6T} = 6/12 = 1/2. 


Also, P(A N B) = P({5T}) = 1/12, which is indeed equal to (1/6)(1/2). Hence, A 
and B are independent in this case. E 


For multiple events, the definition of independence is somewhat more involved. 


Definition 1.5.3 A collection of events 41, 42, 43, ... are independent if 


P(Aj,1---M Ai) = P(A) P(A) 


for any finite subcollection A;,,..., A; ; of distinct events. 


EXAMPLE 1.5.4 
According to Definition 1.5.3, three events 4, B, and C are independent if all of the 
following equations hold: 


P(ANB) = P(A)P(B), 
P(ANC) = P(A)P(C), 
P(BNC) = P(B)P(C), (1.5.3) 
and 
P(ANBNC) = P(A)P(B)P(C). (1.5.4) 


It is not sufficient to check just some of these conditions to verify independence. For 
example, suppose that S = {1, 2,3, 4}, with P({1}) = P({2}) = PRY = PU4) = 
1/4. Let A = {1,2}, B = {1,3}, and C = {1,4}. Then each of the three equations 
(1.5.3) holds, but equation (1.5.4) does not hold. Here, the events A, B, and C are 
called pairwise independent, but they are not independent. E 


Chapter 1: Probability Models 25 


Summary of Section 1.5 


e Conditional probability measures the probability that A occurs given that B oc- 
curs; it is given by P(A | B) = P(4 A B) / P(B). 
Conditional probability satisfies its own law of total probability. 


Events are independent if they have no effect on each other’s probabilities. For- 
mally, this means that P(A N B) = P(A)P(B). 

e If A and B are independent, and P(A) > 0 and P(B) > 0, then P(4 |B) = 
P(A) and P(B | A) = P(B). 


EXERCISES 


1.5.1 Suppose that we roll four fair six-sided dice. 

(a) What is the conditional probability that the first die shows 2, conditional on the 
event that exactly three dice show 2? 

(b) What is the conditional probability that the first die shows 2, conditional on the 
event that at least three dice show 2? 


1.5.2 Suppose we flip two fair coins and roll one fair six-sided die. 

(a) What is the probability that the number of heads equals the number showing on the 
die? 

(b) What is the conditional probability that the number of heads equals the number 
showing on the die, conditional on knowing that the die showed 1? 

(c) Is the answer for part (b) larger or smaller than the answer for part (a)? Explain 
intuitively why this is so. 

1.5.3 Suppose we flip three fair coins. 

(a) What is the probability that all three coins are heads? 

(b) What is the conditional probability that all three coins are heads, conditional on 
knowing that the number of heads is odd? 

(c) What is the conditional probability that all three coins are heads, given that the 
number of heads is even? 

1.5.4 Suppose we deal five cards from an ordinary 52-card deck. What is the con- 
ditional probability that all five cards are spades, given that at least four of them are 
spades? 

1.5.5 Suppose we deal five cards from an ordinary 52-card deck. What is the condi- 
tional probability that the hand contains all four aces, given that the hand contains at 
least four aces? 

1.5.6 Suppose we deal five cards from an ordinary 52-card deck. What is the condi- 
tional probability that the hand contains no pairs, given that it contains no spades? 
1.5.7 Suppose a baseball pitcher throws fastballs 80% of the time and curveballs 20% 
of the time. Suppose a batter hits a home run on 8% of all fastball pitches, and on 5% 
of all curveball pitches. What is the probability that this batter will hit a home run on 
this pitcher’s next pitch? 

1.5.8 Suppose the probability of snow is 20%, and the probability of a traffic accident 
is 10%. Suppose further that the conditional probability of an accident, given that it 


26 Section 1.5: Conditional Probability and Independence 


snows, is 40%. What is the conditional probability that it snows, given that there is an 
accident? 

1.5.9 Suppose we roll two fair six-sided dice, one red and one blue. Let A be the event 
that the two dice show the same value. Let B be the event that the sum of the two dice 
is equal to 12. Let C be the event that the red die shows 4. Let D be the event that the 
blue die shows 4. 

(a) Are A and B independent? 

(b) Are A and C independent? 

(c) Are A and D independent? 

(d) Are C and D independent? 

(e) Are A, C, and D all independent? 

1.5.10 Consider two urns, labelled urn #1 and urn #2. Suppose, as in Exercise 1.4.11, 
that urn #1 has 5 red and 7 blue balls, that urn #2 has 6 red and 12 blue balls, and that 
we pick three balls uniformly at random from each of the two urns. Conditional on the 
fact that all six chosen balls are the same color, what is the conditional probability that 
this color is red? 

1.5.11 Suppose we roll a fair six-sided die and then flip a number of fair coins equal to 
the number showing on the die. (For example, if the die shows 4, then we flip 4 coins.) 
(a) What is the probability that the number of heads equals 3? 

(b) Conditional on knowing that the number of heads equals 3, what is the conditional 
probability that the die showed the number 5? 

1.5.12 Suppose we roll a fair six-sided die and then pick a number of cards from a 
well-shuffled deck equal to the number showing on the die. (For example, if the die 
shows 4, then we pick 4 cards.) 

(a) What is the probability that the number of jacks in our hand equals 2? 

(b) Conditional on knowing that the number of jacks in our hand equals 2, what is the 
conditional probability that the die showed the number 3? 


PROBLEMS 


1.5.13 Consider three cards, as follows: One is red on both sides, one is black on both 
sides, and one is red on one side and black on the other. Suppose the cards are placed 
in a hat, and one is chosen at random. Suppose further that this card is placed flat on 
the table, so we can see one side only. 

(a) What is the probability that this one side is red? 

(b) Conditional on this one side being red, what is the probability that the card showing 
is the one that is red on both sides? (Hint: The answer is somewhat surprising.) 

(c) Suppose you wanted to verify the answer in part (b), using an actual, physical 
experiment. Explain how you could do this. 

1.5.14 Prove that A and B are independent if and only if A© and B are independent. 


1.5.15 Let A and B be events of positive probability. Prove that P(A | B) > P(A) if 
and only if P(B | A) > P(B). 


Chapter 1: Probability Models 27 


CHALLENGES 


1.5.16 Suppose we roll three fair six-sided dice. Compute the conditional probability 
that the first die shows 4, given that the sum of the three numbers showing is 12. 
1.5.17 (The game of craps) The game of craps is played by rolling two fair, six-sided 
dice. On the first roll, if the sum of the two numbers showing equals 2, 3, or 12, then 
the player immediately loses. If the sum equals 7 or 11, then the player immediately 
wins. If the sum equals any other value, then this value becomes the player’s “point.” 
The player then repeatedly rolls the two dice, until such time as he or she either rolls 
the point value again (in which case he or she wins) or rolls a 7 (in which case he or 
she loses). 

(a) Suppose the player’s point is equal to 4. Conditional on this, what is the conditional 
probability that he or she will win (i.e., will roll another 4 before rolling a 7)? (Hint: 
The final roll will be either a 4 or 7; what is the conditional probability that it is a 4?) 
(b) For 2 < i < 12, let p; be the conditional probability that the player will win, 
conditional on having rolled i on the first roll. Compute p; for all i with 2 < i < 12. 
(Hint: You’ve already done this fori = 4 in part (b). Also, the cases i = 2,3, 7,11, 12 
are trivial. The other cases are similar to the i = 4 case.) 

(c) Compute the overall probability that a player will win at craps. (Hint: Use part (b) 
and Theorem 1.5.1.) 

1.5.18 (The Monty Hall problem) Suppose there are three doors, labeled A, B, and C. 
A new car is behind one of the three doors, but you don’t know which. You select one 
of the doors, say, door A. The host then opens one of doors B or C, as follows: If the 
car is behind B, then they open C; if the car is behind C, then they open B; if the car 
is behind A, then they open either B or C with probability 1/2 each. (In any case, the 
door opened by the host will not have the car behind it.) The host then gives you the 
option of either sticking with your original door choice (1.e., A), or switching to the 
remaining unopened door (i.e., whichever of B or C the host did not open). You then 
win (i.e., get to keep the car) if and only if the car is behind your final door selection. 
(Source: Parade Magazine, “Ask Marilyn” column, September 9, 1990.) Suppose for 
definiteness that the host opens door B. 

(a) If you stick with your original choice (i.e., door A), conditional on the host having 
opened door B, then what is your probability of winning? (Hint: First condition on the 
true location of the car. Then use Theorem 1.5.2.) 

(b) If you switch to the remaining door (i.e., door C), conditional on the host having 
opened door B, then what is your probability of winning? 

(c) Do you find the result of parts (a) and (b) surprising? How could you design a 
physical experiment to verify the result? 

(d) Suppose we change the rules so that, if you originally chose A and the car was in- 
deed behind A, then the host always opens door B. How would the answers to parts (a) 
and (b) change in this case? 

(e) Suppose we change the rules so that, if you originally chose A, then the host al- 
ways opens door B no matter where the car is. We then condition on the fact that door 
B happened not to have a car behind it. How would the answers to parts (a) and (b) 
change in this case? 


28 Section 1.6: Continuity of P 


DISCUSSION TOPICS 


1.5.19 Suppose two people each flip a fair coin simultaneously. Will the results of the 
two flips usually be independent? Under what sorts of circumstances might they not be 
independent? (List as many such circumstances as you can.) 

1.5.20 Suppose you are able to repeat an experiment many times, and you wish to 
check whether or not two events are independent. How might you go about this? 
1.5.21 The Monty Hall problem (Challenge 1.5.18) was originally presented by Mar- 
ilyn von Savant, writing in the “Ask Marilyn” column of Parade Magazine. She gave 
the correct answer. However, many people (including some well-known mathemati- 
cians, plus many laypeople) wrote in to complain that her answer was incorrect. The 
controversy dragged on for months, with many letters and very strong language written 
by both sides (in the end, von Savant was vindicated). Part of the confusion lay in the 
assumptions being made, e.g., some people misinterpreted her question as that of the 
modified version of part (e) of Challenge 1.5.18. However, a lot of the confusion was 
simply due to mathematical errors and misunderstandings. (Source: Parade Magazine, 
“Ask Marilyn” column, September 9, 1990; December 2, 1990; February 17, 1991; 
July 7, 1991.) 

(a) Does it surprise you that so many people, including well-known mathematicians, 
made errors in solving this problem? Why or why not? 

(b) Does it surprise you that so many people, including many laypeople, cared so 
strongly about the answer to this problem? Why or why not? 


1.6 | Continuity of P 


Suppose 41, 42, ...is a sequence of events that are getting “closer” (in some sense) to 
another event, 4. Then we might expect that the probabilities P (41), P(A2),... are 
getting close to P(A), i.e., that limn+5o P(An) = P(A). But can we be sure about 
this? 

Properties like this, which say that P (4,) is close to P(A) whenever A, is “close” 
to A, are called continuity properties. The above question can thus be translated, 
roughly, as asking whether or not probability measures P are “continuous.” It turns 
out that P is indeed continuous in some sense. 

Specifically, let us write {A,} Z7 A and say that the sequence {4n} increases to A, 
if 4) C Ap C A3 C ---, and also LJ, An = A. That is, the sequence of events 
is an increasing sequence, and furthermore its union is equal to A. For example, if 
An = (1/n,n], then 41 C Ay C--- and URZ; An = (0, 00). Hence, {(1/n,n]} 7 
(0, co) . Figure 1.6.1 depicts an increasing sequence of subsets. 

Similarly, let us write {An} \. A and say that the sequence {4n} decreases to 
A, if Ay D Az D 4 D ---, and also NX; An = A. That is, the sequence of 
events is a decreasing sequence, and furthermore its intersection is equal to A. For 
example, if An = (—1/n, 1/n] , then 4; D Ap D--- and NZ; An = {0}. Hence, 
{(-I/n, 1/n]} N {0}. Figure 1.6.2 depicts a decreasing sequence of subsets. 


Chapter 1: Probability Models 29 


Figure 1.6.1: An increasing sequence of subsets 41 C A2 C AB C... 


Figure 1.6.2: A decreasing sequence of subsets 41 D A2 D 43 D... 


We will consider such sequences of sets at several points in the text. For this we 
need the following result. 


Theorem 1.6.1 Let 4, 41, Az, ... be events, and suppose that either {4,} 7 A or 
{An} N A. Then 


jim P(An) = P(A). 


PROOF | See Section 1.7 for the proof of this theorem. E 


EXAMPLE 1.6.1 
Suppose S is the set of all positive integers, with P (s) = 27° for all s € S. Then what 
is P({5, 6, 7, 8, ...})? 

We begin by noting that the events An = {5,6,7,8,...,m} increase to A = 
{5,6,7,8,...}, 1€., {4n} A A. Hence, using continuity of probabilities, we must 


30 Section 1.6: Continuity of P 


have 


P({5, 6, 7,8, ...}) jim PCS, 6,7, 8,...,7) 


= lim (P(5)+ P()+---+P()) 
noo 
275 — 9-1-1 
faama j =5 —6 eee =n pa i eS 
= fn 42%+--42") = ln (FST) 


= lim (~ 7 2-") 94 = 1/16. 


n= 
Alternatively, we could use countable additivity directly, to conclude that 
P({5, 6, 7,8,...}) = P(S) + P(6) + PM) +-:-, 
which amounts to the same thing. E 
EXAMPLE 1.6.2 
Let P be some probability measure on the space S = R!. Suppose 
P(GB,5+1/n)) >ô 


for all n, where 6 > 0. Let An = (3,5 +1/n). Then {An} N A where A = @G, 5]. 
Hence, we must have P(A) = P((3, 5J) > das well. 

Note, however, that we could still have P((3,5)) = 0. For example, perhaps 
P({5}) = ô, but P((3,5)) = 0.0 


Summary of Section 1.6 


e If {4,} Z A or {An} N A, then limy599 P(An) = P(A). 


e This allows us to compute or bound various probabilities that otherwise could 
not be understood. 


EXERCISES 


1.6.1 Suppose that S = {1, 2,3, ...} is the set of all positive integers and that P({s}) = 
2~* for all s e S. Compute P(A) where A = {2,4,6,...} is the set of all even 
positive integers. Do this in two ways — by using continuity of P (together with finite 
additivity) and by using countable additivity. 


1.6.2 Consider the uniform distribution on [0, 1]. Compute (with proof) 


Jim P([1/4, 1- e™]). 


1.6.3 Suppose that S = {1,2,3,...} is the set of all positive integers and that P is 
some probability measure on S. Prove that we must have 


lim P({1,2,...,n}) =1. 
noo 


1.6.4 Suppose P([0, 7%) = 24e forall n = 1,2,3,... . What must P({0}) be? 


Chapter 1: Probability Models 31 


1.6.5 Suppose P ([0, 1]) = 1, but P({1/n, 1]) = 0 for all n = 1,2,3,... . What must 
P ({0}) be? 

1.6.6 Suppose P ([1/n, 1/2]) < 1/3 forall n =1,2,3,.... 

(a) Must we have P((0, 1/2]) < 1/3? 

(b) Must we have P((0, 1/2]) < 1/3? 

1.6.7 Suppose P ([0, o0)) = 1. Prove that there is some n such that P([0,]) > 0.9. 
1.6.8 Suppose P((0, 1/2]) = 1/3. Prove that there is some n such that P([1/n, 1/2]) > 
1/4. 

1.6.9 Suppose P({0, 1/2]) = 1/3. Must there be some n such that P({1/n, 1/2]) > 
1/4? 


PROBLEMS 


1.6.10 Let P be some probability measure on sample space S = [0, 1]. 
(a) Prove that we must have limao P((0, 1/n)) = 0. 
(b) Show by example that we might have limy_,o. P ((0, 1/n)) > 0. 


CHALLENGES 


1.6.11 Suppose we know that P is finitely additive, but we do not know that it is 
countably additive. In other words, we know that P(A; U---U4A,) = P(A1) +--+ + 
P(A,,) for any finite collection of disjoint events {41, ..., An}, but we do not know 
about P(A; U A2 U---) for infinite collections of disjoint events. Suppose further 
that we know that P is continuous in the sense of Theorem 1.6.1. Using this, give a 
proof that P must be countably additive. (In effect, you are proving that continuity of 
P is equivalent to countable additivity of P, at least once we know that P is finitely 
additive.) 


1.7 | Further Proofs (Advanced) 
Proof of Theorem 1.3.4 


We want to prove that whenever A1, A2, ... is a finite or countably infinite sequence of 
events, not necessarily disjoint, then P(A, U A42 U ---) < P(A1) + P(42) +. 
Let Bı = Aj, and for n > 2, let By = Ay O (41 U --- U An_1)*. Then B1, B2,... 
are disjoint, Bı U B2 U--- = A1 U A2 U - -- and, by additivity, 
P(A, U 42 U---) = P(B U B2U---) = P(B) + P(B2)+---. (1.7.1) 


Furthermore, An 2 Bn, so by monotonicity, we have P(4n) > P (Bn). It follows from 
(1.7.1) that 


P(A, U A2U-:-) = P (B1) + P(B2) +--+ < P(A1) + P(A2) +--+- 


as claimed. E 


32 Section 1.7: Further Proofs (Advanced) 


Proof of Theorem 1.6.1 
We want to prove that when A, A1, A2, . . . are events, and either {An} A Aor {An} N 
A, then liM, P(An) = P(A). 
Suppose first that {4,,} 7 A. Then we can write 
A=A] U (42 N AS) U (43 N 45) U- 
where the union is disjoint. Hence, by additivity, 
P(A) = P(41) + P(42N Af) + P (43 0 45) ++. 
Now, by definition, writing this infinite sum is the same thing as writing 
P(A)= lim (P(A1) + P(42N Af) +- + P(A N Ay +4) : (1.7.2) 
n—->oco 
However, again by additivity, we see that 
P(41) + P(42 N AS) + P(43 N 45) +--+ + P(4An O AG_1) = P(An). 


Substituting this information into (1.7.2), we obtain P(A) = lim,-595 P (An), which 
was to be proved. 

Suppose now that {A,} N A. Let B, = 4$, and let B = 4°. Then 
we see that {B,} Z B (why?). Hence, by what we just proved, we must have P(B) = 
limy—yoo0 P (Bn). But then, using (1.3.1), we have 


1 — P(A) = lim {1 — P(An)}, 


from which it follows that P(A) = limy_59 P (4n). This completes the proof. E 


Chapter 2 


Random Variables and 
Distributions 


CHAPTER OUTLINE 

Section 1 Random Variables 

Section 2 Distributions of Random Variables 
Section 3 Discrete Distributions 

Section 4 Continuous Distributions 

Section 5 Cumulative Distribution Functions 
Section 6 |= One-Dimensional Change of Variable 
Section 7 Joint Distributions 

Section 8 Conditioning and Independence 
Section 9 Multidimensional Change of Variable 
Section 10 Simulating Probability Distributions 
Section 11 Further Proofs (Advanced) 


In Chapter 1, we discussed the probability model as the central object of study in the 
theory of probability. This required defining a probability measure P on a class of 
subsets of the sample space S. It turns out that there are simpler ways of presenting a 
particular probability assignment than this — ways that are much more convenient to 
work with than P. This chapter is concerned with the definitions of random variables, 
distribution functions, probability functions, density functions, and the development 
of the concepts necessary for carrying out calculations for a probability model using 
these entities. This chapter also discusses the concept of the conditional distribution of 
one random variable, given the values of others. Conditional distributions of random 
variables provide the framework for discussing what it means to say that variables are 
related, which is important in many applications of probability and statistics. 


33 


34 Section 2.1: Random Variables 


2.1| Random Variables 


The previous chapter explained how to construct probability models, including a sam- 
ple space S and a probability measure P. Once we have a probability model, we may 
define random variables for that probability model. 

Intuitively, a random variable assigns a numerical value to each possible outcome 
in the sample space. For example, if the sample space is {rain, snow, clear}, then we 
might define a random variable X such that X = 3 if it rains, X = 6 if it snows, and 
X = —2.7 if it is clear. 

More formally, we have the following definition. 


Definition 2.1.1 A random variable is a function from the sample space S to the 
set R! of all real numbers. 


Figure 2.1.1 provides a graphical representation of a random variable X taking a re- 
sponse value s € S into a real number X (s) € R!. 


R! 


X(s) 
Figure 2.1.1: A random variable X as a function on the sample space S and taking values in 


R!. 


EXAMPLE 2.1.1 4 Very Simple Random Variable 

The random variable described above could be written formally as X : {rain, snow, 
clear} > R! by X(rain) = 3, X (snow) = 6, and X (clear) = —2.7. We will return to 
this example below. E 


We now present several further examples. The point is, we can define random 
variables any way we like, as long as they are functions from the sample space to R!. 


EXAMPLE 2.1.2 

For the case S = {rain, snow, clear}, we might define a second random variable Y by 
saying that Y = 0 if it rains, Y = —1/2 if it snows, and Y = 7/8 if it is clear. That is 
Y (rain) = 0, Y (snow) = 1/2, and Y (clear) = 7/8. E 


EXAMPLE 2.1.3 

If the sample space corresponds to flipping three different coins, then we could let X 
be the total number of heads showing, let Y be the total number of tails showing, let 
Z = 0 if there is exactly one head, and otherwise Z = 17, etc. E 


EXAMPLE 2.1.4 

If the sample space corresponds to rolling two fair dice, then we could let X be the 
square of the number showing on the first die, let Y be the square of the number show- 
ing on the second die, let Z be the sum of the two numbers showing, let W be the 


Chapter 2: Random Variables and Distributions 35 


square of the sum of the two numbers showing, let R be the sum of the squares of the 
two numbers showing, etc. I 


EXAMPLE 2.1.5 Constants as Random Variables 
As a special case, every constant value c is also a random variable, by saying that 
c(s) = c for all s e S. Thus, 5 is a random variable, as is 3 or —21.6. E 


EXAMPLE 2.1.6 Indicator Functions 
One special kind of random variable is worth mentioning. If A is any event, then we 
can define the indicator function of A, written 74, to be the random variable 


uasi qa 


which is equal to 1 on A, and is equal to 0 on AC. E 


Given random variables X and Y, we can perform the usual arithmetic operations 
on them. Thus, for example, Z = X? is another random variable, defined by Z (s) = 
X? (s) = (X(s)y? = X(s) x X(s). Similarly, if W = XY?, then W (s) = X (s) x 
Y (s) x Y(s) x Y(s), etc. Also, if Z = X + Y, then Z(s) = X (s) + Y (s), etc. 


EXAMPLE 2.1.7 
Consider rolling a fair six-sided die, so that S = {1, 2, 3,4, 5, 6}. Let X be the number 
showing, so that X(s) = s for s €e S. Let Y be three more than the number showing, 
so that Y(s) = s +3. Let Z = X? + Y. Then Z(s) = X (s)? + Y(s) =s? +s +3. So 
Z(1) =5, Z(2) = 9, etc. E 

We write X = Y to mean that X(s) = Y(s) for all s € S. Similarly, we write 


X < Y to mean that X (s) < Y (s) for alls € S, and X > Y to mean that X(s) > Y(s) 
for all s € S. For example, we write X < c to mean that X (s) < c for all s e S. 


EXAMPLE 2.1.8 
Again consider rolling a fair six-sided die, with S = {1,2, 3,4,5,6}. Fors €e S, let 
X(s) =s, and let Y = X + 1e}. This means that 


Y(s) = X(s) + ko (s) = | : ; = 


Hence, Y(s) = X(s) for 1 < s < 5. But it is not true that Y = X, because Y(6) Æ 
X (6). On the other hand, it is true that Y > X.U 


EXAMPLE 2.1.9 

For the random variable of Example 2.1.1 above, it is not true that X > 0, nor is it true 
that X < 0. However, it is true that X > —2.7 and that X < 6. It is also true that 
X > —10 and X < 100.5 


If S is infinite, then a random variable X can take on infinitely many different 
values. 


EXAMPLE 2.1.10 
If S = {1,2,3,...}, with P{s} = 2™ for all s € S, and if X is defined by X(s) = $°, 
then we always have X > 1. But there is no largest value of X (s) because the value 


36 Section 2.1: Random Variables 


X(s) increases without bound as s —> oo. We shall call such a random variable an 
unbounded random variable. E 


Finally, suppose X is a random variable. We know that different states s occur with 
different probabilities. It follows that X(s) also takes different values with different 
probabilities. These probabilities are called the distribution of X; we consider them 
next. 


Summary of Section 2.1 


e A random variable is a function from the state space to the set of real numbers. 


e The function could be constant, or correspond to counting some random quantity 
that arises, or any other sort of function. 


EXERCISES 


2.1.1 Let S = {1,2,3,...}, and let X(s) = s? and Y(s) = 1/s fors € S. For each 
of the following quantities, determine (with explanation) whether or not it exists. If it 
does exist, then give its value. 

(a) minses X (s) 

(b) maxses X (s) 

(c) minses Y (s) 

(d) maxses Y (s) 

2.1.2 Let S = {high, middle, low}. Define random variables X, Y, and Z by X (high) = 
—12, X (middle) = —2, X (low) = 3, Y (high) = 0, Y(middle) = 0, Y (low) = 1, 
Z (high) = 6, Z (middle) = 0, Z (low) = 4. Determine whether each of the following 
relations is true or false. 

@)X<Y 

b)X<Y 

()¥ <Z 

(d)Y<Z 

(e) XY <Z 

(QXY<Z 

2.1.3 Let S = {1, 2, 3, 4, 5}. 

(a) Define two different (i.e., nonequal) nonconstant random variables, X and Y, on S. 
(b) For the random variables X and Y that you have chosen, let Z = X + Y?. Compute 
Z(s) forall s e S. 

2.1.4 Consider rolling a fair six-sided die, so that S = {1, 2,3, 4,5, 6}. Let X(s) =s, 
and Y(s) = s? +2. Let Z = XY. Compute Z(s) for alls € S. 

2.1.5 Let A and B be events, and let X = I4 - Ig. Is X an indicator function? If yes, 
then of what event? 

2.1.6 Let S = {1, 2,3, 4}, X = 11,2), Y = 10,3}, and Z = Ig,4. Let W =X+Y+4+Z. 
(a) Compute W(1). 

(b) Compute W (2). 


Chapter 2: Random Variables and Distributions 37 


(c) Compute W (4). 

(d) Determine whether or not W > Z. 

2.1.7 Let S = {1, 2,3}, X = Im, Y = 123}, and Z = 1112). Let W =X -Y +Z. 
(a) Compute W(1). 

(b) Compute W (2). 

(c) Compute W (3). 

(d) Determine whether or not W > Z. 

2.1.8 Let S = {1,2, 3,4,5}, X = Jy12,3);, Y = J 3), and Z = Ig345). Let W = 
X-Y+Z. 

(a) Compute W(1). 

(b) Compute W (2). 

(c) Compute W (5). 

(d) Determine whether or not W > Z. 

2.1.9 Let S = {1,2,3,4}, X = In ,2}, and Y (s) = s? X (s). 

(a) Compute Y (1). 

(b) Compute Y (2). 

(c) Compute Y (4). 


PROBLEMS 


2.1.10 Let X be a random variable. 

(a) Is it necessarily true that X > 0? 

(b) Is it necessarily true that there is some real number c such that X + c > 0? 

(c) Suppose the sample space S is finite. Then is it necessarily true that there is some 
real number c such that X + c > 0? 

2.1.11 Suppose the sample space S is finite. Is it possible to define an unbounded 
random variable on S? Why or why not? 

2.1.12 Suppose X is a random variable that takes only the values 0 or 1. Must X be an 
indicator function? Explain. 

2.1.13 Suppose the sample space S is finite, of size m. How many different indicator 
functions can be defined on S? 

2.1.14 Suppose X is a random variable. Let Y = /X. Must Y be a random variable? 
Explain. 


DISCUSSION TOPICS 


2.1.15 Mathematical probability theory was introduced to the English-speaking world 
largely by two American mathematicians, William Feller and Joe Doob, writing in the 
early 1950s. According to Professor Doob, the two of them had an argument about 
whether random variables should be called “random variables” or “chance variables.” 
They decided by flipping a coin — and “random variables” won. (Source: Statistical 
Science 12 (1997), No. 4, page 307.) Which name do you think would have been a 
better choice? 


38 Section 2.2: Distributions of Random Variables 


2.2 | Distributions of Random Variables 


Because random variables are defined to be functions of the outcome s, and because 
the outcome s is assumed to be random (i.e., to take on different values with different 
probabilities), it follows that the value of a random variable will itself be random (as 
the name implies). 

Specifically, if X is a random variable, then what is the probability that X will equal 
some particular value x? Well, X = x precisely when the outcome s is chosen such 
that X(s) =x. 


EXAMPLE 2.2.1 

Let us again consider the random variable of Example 2.1.1, where S = {rain, snow, 
clear}, and X is defined by X (rain) = 3, X (snow) = 6, and X (clear) = —2.7. Suppose 
further that the probability measure P is such that P(rain) = 0.4, P(snow) = 0.15, 
and P(clear) = 0.45. Then clearly, X = 3 only when it rains, X = 6 only when 
it snows, and X = —2.7 only when it is clear. Thus, P(X = 3) = P(rain) = 0.4, 
P(X = 6) = P(snow) = 0.15, and P(X = —2.7) = P(clear) = 0.45. Also, 
P(X = 17) = 0, and in fact P(X = x) = P(@®) = 0 forall x ¢ {3, 6, —2.7}. We can 
also compute that 


P(X e (3, 6}) = P(X =3) + P(X = 6) = 0.4 + 0.15 = 0.55, 
while 

P(X <5) = P(X =3) + P(X = -2.7) = 0.4 + 0.45 = 0.85, 
etc. E 


We see from this example that, if B is any subset of the real numbers, then P(X e€ 
B) = P({s e S : X(s) € B}). Furthermore, to understand X well requires knowing 
the probabilities P(X e B) for different subsets B. That is the motivation for the 
following definition. 


Definition 2.2.1 If X is a random variable, then the distribution of X is the collec- 


tion of probabilities P(X € B) for all subsets B of the real numbers. 


Strictly speaking, it is required that B be a Borel subset, which is a technical restriction 
from measure theory that need not concern us here. Any subset that we could ever 
write down is a Borel subset. 

In Figure 2.2.1, we provide a graphical representation of how we compute the dis- 
tribution of a random variable X. For a set B, we must find the elements in s € S such 
that X(s) € B. These elements are given by the set {s e S : X(s) € B}. Then we 
evaluat. the probability P({s € S : X(s) € B}). We must do this for every subset 
BCR. 


Chapter 2: Random Variables and Distributions 39 


n>, 
5 T 
r R! 
a b 


Figure 2.2.1: If B = (a,b) C R!, then {s € S : X (s) € B} is the set of elements such that 
a < X(s) <b. 


EXAMPLE 2.2.2 4 Very Simple Distribution 

Consider once again the above random variable, where S = {rain, snow, clear} and 
where X is defined by X(rain) = 3, X(snow) = 6, and X(clear) = —2.7, and 
P(rain) = 0.4, P(snow) = 0.15, and P(clear) = 0.45. What is the distribution of 
X? Well, if B is any subset of the real numbers, then P(X e B) should count 0.4 if 
3 €e B, plus 0.15 if 6 €e B, plus 0.45 if —2.7 e B. We can formally write all this 
information at once by saying that 


P(X € B) =04 Ip(3) + 0.15 Ig (6) + 0.45 Ip (2.7), 
where again /g(x) = 1 ifx € B, and Ig(x) =0 ifx ¢ B.E 


EXAMPLE 2.2.3 An Almost-As-Simple Distribution 
Consider once again the above setting, with S = {rain, snow, clear}, and P(rain) = 0.4, 
P(snow) = 0.15, and P(clear) = 0.45. Consider a random variable Y defined by 
Y (rain) = 5, Y (snow) = 7, and Y (clear) = 5. 

What is the distribution of Y? Clearly, Y = 7 only when it snows, so that P(Y = 
7) = P(snow) = 0.15. However, here Y = 5 if it rains or if it is clear. Hence, 
P(Y = 5) = P({rain, clear}) = 0.4 + 0.45 = 0.85. Therefore, if B is any subset of 
the real numbers, then 


P(Y € B) =0.15 I(T) +0.85 Ig (5). E 


While the above examples show that it is possible to keep track of P(X e B) for all 
subsets B of the real numbers, they also indicate that it is rather cumbersome to do so. 
Fortunately, there are simpler functions available to help us keep track of probability 
distributions, including cumulative distribution functions, probability functions, and 
density functions. We discuss these next. 


Summary of Section 2.2 


e The distribution of a random variable X is the collection of probabilities P (X € 
B) of X belonging to various sets. 

e The probability P(X € B) is determined by calculating the probability of the set 
of response values s such that X(s) € B, i.e., P(X e B) = P({s e S: X(s) € 
B}). 


40 Section 2.2: Distributions of Random Variables 


EXERCISES 


2.2.1 Consider flipping two independent fair coins. Let X be the number of heads that 
appear. Compute P(X = x) for all real numbers x. 

2.2.2 Suppose we flip three fair coins, and let X be the number of heads showing. 

(a) Compute P(X = x) for every real number x. 

(b) Write a formula for P(X € B), for any subset B of the real numbers. 

2.2.3 Suppose we roll two fair six-sided dice, and let Y be the sum of the two numbers 
showing. 

(a) Compute P(Y = y) for every real number y. 

(b) Write a formula for P(Y € B), for any subset B of the real numbers. 

2.2.4 Suppose we roll one fair six-sided die, and let Z be the number showing. Let 
W = Z? +4, and let V = JZ. 

(a) Compute P(W = w) for every real number w. 

(b) Compute P(V = v) for every real number v. 

(c) Compute P(ZW = x) for every real number x. 

(d) Compute P(V W = y) for every real number y. 

(e) Compute P(V + W =r) for every real number r. 

2.2.5 Suppose that a bowl contains 100 chips: 30 are labelled 1, 20 are labelled 2, and 
50 are labelled 3. The chips are thoroughly mixed, a chip is drawn, and the number X 
on the chip is noted. 

(a) Compute P(X = x) for every real number x. 

(b) Suppose the first chip is replaced, a second chip is drawn, and the number Y on the 
chip noted. Compute P(Y = y) for every real number y. 

(c) Compute P(W = w) for every real number w when W = X + Y. 

2.2.6 Suppose a standard deck of 52 playing cards is thoroughly shuffled and a single 
card is drawn. Suppose an ace has value 1, a jack has value 11, a queen has value 12, 
and a king has value 13. 

(a) Compute P(X = x) for every real number x, when X is the value of the card 
drawn. 

(b) Suppose that Y = 1,2,3, or 4 when a diamond, heart, club, or spade is drawn. 
Compute P(Y = y) for every real number y. 

(c) Compute P(W = w) for every real number w when W = X + Y. 

2.2.7 Suppose a university is composed of 55% female students and 45% male stu- 
dents. A student is selected to complete a questionnaire. There are 25 questions on 
the questionnaire administered to a male student and 30 questions on the questionnaire 
administered to a female student. If X denotes the number of questions answered by a 
randomly selected student, then compute P(X = x) for every real number x. 

2.2.8 Suppose that a bowl contains 10 chips, each uniquely numbered 0 through 9. 
The chips are thoroughly mixed, one is drawn and the number on it, X4, is noted. This 
chip is then replaced in the bowl. A second chip is drawn and the number on it, X2, is 
noted. Compute P(W = w) for every real number w when W = X; + 10X2. 


Chapter 2: Random Variables and Distributions 41 


PROBLEMS 


2.2.9 Suppose that a bowl contains 10 chips each uniquely numbered 0 through 9. The 
chips are thoroughly mixed, one is drawn and the number on it, X1, is noted. This chip 
is not replaced in the bowl. A second chip is drawn and the number on it, X2, is noted. 
Compute P(W = w) for every real number w when W = X1 + 10X2. 


CHALLENGES 


2.2.10 Suppose Alice flips three fair coins, and let X be the number of heads showing. 
Suppose Barbara flips five fair coins, and let Y be the number of heads showing. Let 
Z = X — Y. Compute P(Z = z) for every real number z. 


2.3 | Discrete Distributions 


For many random variables X, we have P(X = x) > 0 for certain x values. This 
means there is positive probability that the variable will be equal to certain particular 
values. 
If 
5 P(X =x)=1, 


xeR! 


then all of the probability associated with the random variable X can be found from the 
probability that X will be equal to certain particular values. This prompts the following 
definition. 


Definition 2.3.1 A random variable X is discrete if 


` P=, 


xeR! 


At first glance one might expect (2.3.1) to be true for any random variable. How- 
ever, (2.3.1) does not hold for the uniform distribution on [0, 1] or for other continuous 
distributions, as we shall see in the next section. 

Random variables satisfying (2.3.1) are simple in some sense because we can un- 
derstand them completely just by understanding their probabilities of being equal to 
particular values x. Indeed, by simply listing out all the possible values x such that 
P(X = x) > 0, we obtain a second, equivalent definition, as follows. 


Definition 2.3.2 A random variable X is discrete if there is a finite or countable se- 
quence x1, x2, . . . Of distinct real numbers, and a corresponding sequence p1, p2,... 


of nonnegative real numbers, such that P(X = x;) = p; for alli, and X`; pi = 1. 


This second definition also suggests how to keep track of discrete distributions. It 
prompts the following definition. 


42 Section 2.3: Discrete Distributions 


Definition 2.3.3 For a discrete random variable X, its probability function is the 
function py : R! — [0, 1] defined by 


px(x) = P(X =x). 


Hence, if x1,x2,... are the distinct values such that P(X = xi) = pi for all i with 
> pi = 1, then 


_ | Pi x =x; for some i 
P= | 0 otherwise. 


Clearly, all the information about the distribution of X is contained in its probability 
function, but only if we know that X is a discrete random variable. 
Finally, we note that Theorem 1.5.1 immediately implies the following. 


Theorem 2.3.1 (Law of total probability, discrete random variable version) Let X 
be a discrete random variable, and let A be some event. Then 


P(A) = Ds, P(X =x) P(A|X =x). 


xeR! 


2.3.1 | Important Discrete Distributions 


Certain particular discrete distributions are so important that we list them here. 


EXAMPLE 2.3.1 Degenerate Distributions 

Let c be some fixed real number. Then, as already discussed, c is also a random variable 
(in fact, c is a constant random variable). In this case, clearly c is discrete, with 
probability function pe satisfying that p.(c) = 1, and pe (x) = 0 for x # c. Because c 
is always equal to a particular value (namely, c) with probability 1, the distribution of 
c is sometimes called a point mass or point distribution or degenerate distribution. 


EXAMPLE 2.3.2 The Bernoulli Distribution 

Consider flipping a coin that has probability 6 of coming up heads and probability 1 —0 
of coming up tails, where 0 < 0 < 1. Let X = 1 if the coin is heads, while X = 0 if 
the coin is tails. Then py(1) = P(X = 1) = 6, while py(0) = P(X = 0) = 1-8. 
The random variable X is said to have the Bernoulli(@@) distribution; we write this as 
X ~ Bemoulli(@). 

Bernoulli distributions arise anytime we have a response variable that takes only 
two possible values, and we label one of these outcomes as 1 and the other as 0. For 
example, 1 could correspond to success and 0 to failure of some quality test applied to 
an item produced in a manufacturing process. In this case, 8 is the proportion of manu- 
factured items that will pass the test. Alternatively, we could be randomly selecting an 
individual from a population and recording a | when the individual is female and a 0 if 
the individual is a male. In this case, @ is the proportion of females in the population. E 


Chapter 2: Random Variables and Distributions 43 


EXAMPLE 2.3.3 The Binomial Distribution 

Consider flipping n coins, each of which has (independent) probability 0 of coming up 
heads, and probability 1 — 0 of coming up tails. (Again, 0 < 0 < 1.) Let X be the total 
number of heads showing. By (1.4.2), we see that for x =0,1,2,...,n, 


x!(n —x)! ane 


px&) = P(X =x) = (") (1 -0 ™ = 


The random variable X is said to have the Binomial(7, 0) distribution; we write this as 
X ~ Binomial(n, 0). The Bernoulli(@) distribution corresponds to the special case of 
the Binomial (n, 0) distribution when n = 1, namely, Bernoulli(@) = Binomial(1, 0). 
Figure 2.3.1 contains the plots of several Binomial(20, 0) probability functions. 


o 
02 — © 
fe) e 
e e 
o 
e e 
a o 
01-4 
e e 
o o 
e e 
o 
o ° ö ° 
00 — e o o o ® C00 00 oO 8B oe oo o 
T T T 
0 10 20 


x 


Figure 2.3.1: Plot of the Binomial(20, 1/2) (e e e) and the Binomial (20, 1/5) (0 0 0) 
probability functions. 


The binomial distribution is applicable to any situation involving n independent 
performances of a random system; for each performance, we are recording whether a 
particular event has occurred, called a success, or has not occurred, called a failure. If 
we denote the event in question by A and put 0 = P(A), we have that the number of 
successes in the n performances is distributed Binomial (n, 0). For example, we could 
be testing light bulbs produced by a manufacturer, and 0 is the probability that a bulb 
works when we test it. Then the number of bulbs that work in a batch of n is distributed 
Binomial(n, 0). If a baseball player has probability 0 of getting a hit when at bat, then 
the number of hits obtained in n at-bats is distributed Binomial (n, 0). 

There is another way of expressing the binomial distribution that is sometimes 
useful. For example, if X1, X2,...,X, are chosen independently and each has the 
Bernoulli(@) distribution, and Y = X1 +- - -+ Xn, then Y will have the Binomial (n, 0) 
distribution (see Example 3.4.10 for the details). E 


EXAMPLE 2.3.4 The Geometric Distribution 
Consider repeatedly flipping a coin that has probability 0 of coming up heads and prob- 
ability 1 — 6 of coming up tails, where again 0 < 6 < 1. Let X be the number of tails 


44 Section 2.3: Discrete Distributions 


that appear before the first head. Then for k > 0, X = k if and only if the coin shows 
exactly k tails followed by a head. The probability of this is equal to (1 — 6)*6. (In 
particular, the probability of getting an infinite number of tails before the first head is 
equal to (1 — )°@ = 0, so X is never equal to infinity.) Hence, px(k) = (1 — 6)*6, 
fork = 0,1,2,3,.... The random variable X is said to have the Geometric(9) distri- 
bution; we write this as X ~ Geometric(@). Figure 2.3.2 contains the plots of several 
Geometric(9) probability functions. 


05 — @ 
04 — 
0.3 — 
a ° 
02—74 0 
o 
© 
01 4 O 
e Dy 
o 
. On OO Orig 
0.0 — e e è © os EB O @ 
I I I I 
0 5 10 15 


Figure 2.3.2: Plot of the Geometric(1 /2) (e e e) and the Geometric(1/5) (0 o o) probability 
functions at the values 0, 1,..., 15. 


The geometric distribution applies whenever we are counting the number of failures 
until the first success for independent performances of a random system where the 
occurrence of some event is considered a success. For example, the number of light 
bulbs tested that work until the first bulb that does not (a working bulb is considered a 
“failure” for the test) and the number of at-bats without a hit until the first hit for the 
baseball player both follow the geometric distribution. 

We note that some books instead define the geometric distribution to be the number 
of coin flips up to and including the first head, which is simply equal to one plus the 
random variable defined here. E 


EXAMPLE 2.3.5 The Negative-Binomial Distribution 

Generalizing the previous example, consider again repeatedly flipping a coin that has 
probability 0 of coming up heads and probability 1 — @ of coming up tails. Let r be a 
positive integer, and let Y be the number of tails that appear before the rth head. Then 
fork > 0, Y =k if and only if the coin shows exactly r — 1 heads (and & tails) on the 
first r — 1 + k flips, and then shows a head on the (r + k)-th flip. The probability of 
this is equal to 


pry = ("71 a-o = ("ora -ay, 


fork =0,1,2,3,.... 


Chapter 2: Random Variables and Distributions 45 


The random variable Y is said to have the Negative-Binomial(r, 0) distribution; we 
write this as Y ~ Negative-Binomial(r, 0). Of course, the special case r = 1 cor- 
responds to the Geometric(@) distribution. So in terms of our notation, we have that 
Negative-Binomial(1,@) = Geometric(@). Figure 2.3.3 contains the plots of several 
Negative-Binomial(r, 0) probability functions. 


025 [ee 
0.20 ~ 
015 ~ 
010 ~ 


005 ~ e o 
o . o 

.. Or k 
000 ~ o © tesos oooooos 
T T T 
0 10 20 


x 


Figure 2.3.3: Plot of the Negative-Binomial(2, 1/2) (e e e) probability function and the 
Negative-Binomial(10, 1/2) (0 o o) probability function at the values 0, 1,..., 20. 


The Negative-Binomial(r, 0) distribution applies whenever we are counting the 
number of failures until the rth success for independent performances of a random 
system where the occurrence of some event is considered a success. For example, the 
number of light bulbs tested that work until the third bulb that does not and the num- 
ber of at-bats without a hit until the fifth hit for the baseball player both follow the 
negative-binomial distribution. E 


EXAMPLE 2.3.6 The Poisson Distribution 
We say that a random variable Y has the Poisson(A) distribution, and write Y ~ 
Poisson(A), if 


A 
pry) =P =y) = K 
for y = 0,1,2,3,... . We note that since (from calculus) paren) 4? /y! = ef, it is 


indeed true (as it must be) that 0 P(Y = y) = 1. Figure 2.3.4 contains the plots 
of several Poisson(/) probability functions. 


Figure 2.3.4: Plot of the Poisson(2) (e e e) and the Poisson(10) (0 o o) probability functions at 
the values 0, 1,..., 20. 


46 Section 2.3: Discrete Distributions 
We motivate the Poisson distribution as follows. Suppose X ~ Binomial(n, 0), 


i.e., X has the Binomial(n, 0) distribution as in Example 2.3.3. Then for 0 < x <n, 

n = 
P(X =x)= ( ora -0 ™. 

x 

If we set 0 = å /n for some À > 0, then this becomes 

À X A n—-X 
EG) =) 
x n n 


2 oDe (2) (-4) (2.3.2) 


x! n 


P(X =x) 


Let us now consider what happens if we let n — oo in (2.3.2), while keeping x 
fixed at some nonnegative integer. In that case, 


na-N@—2)--@-xt) _ if, 1 i 2 i x+1 
peznezgaem (ii) (3) (#2) 


converges to 1 while (since from calculus (1 + (c/n))" — e° for any c) 


(1-4) =(1-5) (1-4) se. 1 = eà, 
n n n 


Substituting these limits into (2.3.2), we see that 


A* 
lim P(X =x) = S e7} 
noo x! 


forx =0,1,2,3,.... 

Intuitively, we can phrase this result as follows. If we flip a very large number 
of coins n, and each coin has a very small probability 0 = 4/n of coming up heads, 
then the probability that the total number of heads will be x is approximately given 
by A*e-7/x!. Figure 2.3.5 displays the accuracy of this estimate when we are ap- 
proximating the Binomial(100, 1/10) distribution by the Poisson(/) distribution where 
A = n0 = 100(1/10) = 10. 


Figure 2.3.5: Plot of the Binomial(100, 1/10) (e e e) and the Poisson(10) (0 o o) probability 
functions at the values 0,1,..., 20. 


Chapter 2: Random Variables and Distributions 47 


The Poisson distribution is a good model for counting random occurrences of an 
event when there are many possible occurrences, but each occurrence has very small 
probability. Examples include the number of house fires in a city on a given day, the 
number of radioactive events recorded by a Geiger counter, the number of phone calls 
arriving at a switchboard, the number of hits on a popular World Wide Web page on a 
given day, etc. E 


EXAMPLE 2.3.7 The Hypergeometric Distribution 
Suppose that an urn contains M white balls and N — M black balls. Suppose we 
draw n < N balls from the urn in such a fashion that each subset of n balls has the 
same probability of being drawn. Because there are C ) such subsets, this probability 
is 1/ Cs 

One way of accomplishing this is to thoroughly mix the balls in the urn and then 
draw a first ball. Accordingly, each ball has probability 1/N of being drawn. Then, 
without replacing the first ball, we thoroughly mix the balls in the urn and draw a 
second ball. So each ball in the urn has probability 1/(N — 1) of being drawn. We then 
have that any two balls, say the ith and jth balls, have probability 


P (ball i and j are drawn) 
= P (ball i is drawn first) P (ball 7 is drawn second | ball 7 is drawn first) 
+ P(ball j is drawn first) P (ball i is drawn second | ball j is drawn first) 
1 1 1 1 N 
NN-1 NN-1 2 
of being drawn in the first two draws. Continuing in this fashion for n draws, we obtain 
that the probability of any particular set of n balls being drawn is 1/(). This type of 
sampling is called sampling without replacement. 

Given that we take a sample of n, let X denote the number of white balls obtained. 
Note that we must have X > 0 and X > n — (N — M) because at most N — M of 
the balls could be black. Hence, X > max(0, n + M — N). Furthermore, X < n and 
X < M because there are only M white balls. Hence, X < min(n, M). 

So suppose max(0, n + M — N) < x < min(n, M). What is the probability that 
x white balls are obtained? In other words, what is P(X = x)? To evaluate this, we 
know that we need to count the number of subsets of n balls that contain x white balls. 


Using the combinatorial principles of Section 1.4.1, we see that this number is given 
by (^ (X2). Therefore, 


O ra 


for max(0, n + M — N) < x < min(n, M). The random variable X is said to have the 
Hypergeometric(V, M, n) distribution. In Figure 2.3.6, we have plotted some hyper- 
geometric probability functions. The Hypergeometric(20, 10, 10) probability function 
is 0 for x > 10, while the Hypergeometric(20, 10, 5) probability function is 0 for 
x>5. 


48 Section 2.3: Discrete Distributions 


e o 


oo 8S wooo eee eo oe es 
T T 
10 20 
X 


o 
00 — ee 
T 
0 


Figure 2.3.6: Plot of Hypergeometric(20, 10, 10) (e e e) and Hypergeometric (20, 10, 5) 
(0 o o) probability functions. 


Obviously, the hypergeometric distribution will apply to any context wherein we 
are sampling without replacement from a finite set of N elements and where each el- 
ement of the set either has a characteristic or does not. For example, if we randomly 
select people to participate in an opinion poll so that each set of n individuals in a pop- 
ulation of N has the same probability of being selected, then the number of people who 
respond yes to a particular question is distributed Hypergeometric(V, M, n), where M 
is the number of people in the entire population who would respond yes. We will see 
the relevance of this to statistics in Section 5.4.2. E 


Suppose in Example 2.3.7 we had instead replaced the drawn ball before draw- 
ing the next ball. This is called sampling with replacement. It is then clear, from 
Example 2.3.3, that the number of white balls observed in n draws is distributed 
Binomial(n, M/N). 


Summary of Section 2.3 


e A random variable X is discrete if $., P(X = x) = 1, i.e., if all its probability 
comes from being equal to particular values. 

e A discrete random variable X takes on only a finite, or countable, number of 
distinct values. 

e Important discrete distributions include the degenerate, Bernoulli, binomial, geo- 
metric, negative-binomial, Poisson, and hypergeometric distributions. 


EXERCISES 


2.3.1 Consider rolling two fair six-sided dice. Let Y be the sum of the numbers show- 
ing. What is the probability function of Y? 

2.3.2 Consider flipping a fair coin. Let Z = 1 if the coin is heads, and Z = 3 if the 
coin is tails. Let W = Z? + Z. 

(a) What is the probability function of Z? 

(b) What is the probability function of W? 


Chapter 2: Random Variables and Distributions 49 


2.3.3 Consider flipping two fair coins. Let X = 1 if the first coin is heads, and X = 0 
if the first coin is tails. Let Y = 1 if the second coin is heads, and Y = 5 if the second 
coin is tails. Let Z = XY. What is the probability function of Z? 


2.3.4 Consider flipping two fair coins. Let X = 1 if the first coin is heads, and X = 0 
if the first coin is tails. Let Y = 1 if the two coins show the same thing (i.e., both heads 
or both tails), with Y = 0 otherwise. Let Z = X + Y, and W = XY. 

(a) What is the probability function of Z? 

(b) What is the probability function of W? 


2.3.5 Consider rolling two fair six-sided dice. Let W be the product of the numbers 
showing. What is the probability function of W? 

2.3.6 Let Z ~ Geometric(@). Compute P(5 < Z < 9). 

2.3.7 Let X ~ Binomial(12, 0). For what value of 8 is P(X = 11) maximized? 

2.3.8 Let W ~ Poisson(A). For what value of 2 is P(W = 11) maximized? 

2.3.9 Let Z ~ Negative-Binomial(3, 1/4). Compute P(Z < 2). 

2.3.10 Let X ~ Geometric(1/5). Compute P(X? < 15). 

2.3.11 Let Y ~ Binomial(10, 0). Compute P(Y = 10). 

2.3.12 Let X ~ Poisson(/). Let Y = X — 7. What is the probability function of Y? 
2.3.13 Let X ~ Hypergeometric(20, 7, 8). What is the probability that X = 3? What 
is the probability that X = 8? 

2.3.14 Suppose that a symmetrical die is rolled 20 independent times, and each time 
we record whether or not the event {2, 3, 5, 6} has occurred. 

(a) What is the distribution of the number of times this event occurs in 20 rolls? 

(b) Calculate the probability that the event occurs five times. 

2.3.15 Suppose that a basketball player sinks a basket from a certain position on the 
court with probability 0.35. 

(a) What is the probability that the player sinks three baskets in 10 independent throws? 
(b) What is the probability that the player throws 10 times before obtaining the first 
basket? 

(c) What is the probability that the player throws 10 times before obtaining two baskets? 
2.3.16 An urn contains 4 black balls and 5 white balls. After a thorough mixing, a ball 
is drawn from the urn, its color is noted, and the ball is returned to the urn. 

(a) What is the probability that 5 black balls are observed in 15 such draws? 

(b) What is the probability that 15 draws are required until the first black ball is ob- 
served? 

(c) What is the probability that 15 draws are required until the fifth black ball is ob- 
served? 

2.3.17 An urn contains 4 black balls and 5 white balls. After a thorough mixing, a ball 
is drawn from the urn, its color is noted, and the ball is set aside. The remaining balls 
are then mixed and a second ball is drawn. 

(a) What is the probability distribution of the number of black balls observed? 

(b) What is the probability distribution of the number of white balls observed? 


50 Section 2.3: Discrete Distributions 


2.3.18 (Poisson processes and queues) Consider a situation involving a server, e.g., 
a cashier at a fast-food restaurant, an automatic bank teller machine, a telephone ex- 
change, etc. Units typically arrive for service in a random fashion and form a queue 
when the server is busy. It is often the case that the number of arrivals at the server, for 
some specific unit of time t, can be modeled by a Poisson(/r) distribution and is such 
that the number of arrivals in nonoverlapping periods are independent. In Chapter 3, 
we will show that 4t is the average number of arrivals during a time period of length t, 
and so À is the rate of arrivals per unit of time. 

Suppose telephone calls arrive at a help line at the rate of two per minute. A Poisson 
process provides a good model. 
(a) What is the probability that five calls arrive in the next 2 minutes? 
(b) What is the probability that five calls arrive in the next 2 minutes and then five more 
calls arrive in the following 2 minutes? 
(c) What is the probability that no calls will arrive during a 10-minute period? 
2.3.19 Suppose an urn contains 1000 balls — one of these is black, and the other 999 
are white. Suppose that 100 balls are randomly drawn from the urn with replacement. 
Use the appropriate Poisson distribution to approximate the probability that five black 
balls are observed. 
2.3.20 Suppose that there is a loop in a computer program and that the test to exit 
the loop depends on the value of a random variable X. The program exits the loop 
whenever X €e A, and this occurs with probability 1/3. If the loop is executed at least 
once, what is the probability that the loop is executed five times before exiting? 


COMPUTER EXERCISES 


2.3.21 Tabulate and plot the Hypergeometric(20, 8, 10) probability function. 

2.3.22 Tabulate and plot the Binomial(30, 0.3) probability function. Tabulate and plot 
the Binomial(30, 0.7) probability function. Explain why the Binomial (30, 0.3) proba- 
bility function at x agrees with the Binomial (30, 0.7) probability function at n — x. 


PROBLEMS 


2.3.23 Let X be a discrete random variable with probability function py(x) = 2~* for 
x =1,2,3,..., with py(x) = 0 otherwise. 

(a) Let Y = X?. What is the probability function py of Y? 

(b) Let Z = X — 1. What is the distribution of Z? (Identify the distribution by name 
and specify all parameter values.) 

2.3.24 Let X ~ Binomial(n1,@) and Y ~ Binomial(n2,@), with X and Y chosen 
independently. Let Z = X + Y. What will be the distribution of Z? (Explain your 
reasoning.) (Hint: See the end of Example 2.3.3.) 

2.3.25 Let X ~ Geometric(@) and Y ~ Geometric(@), with X and Y chosen indepen- 
dently. Let Z = X + Y. What will be the distribution of Z? Generalize this to r coins. 
(Explain your reasoning.) 


Chapter 2: Random Variables and Distributions 51 


2.3.26 Let X ~ Geometric(@;) and Y ~ Geometric(@2), with X and Y chosen in- 
dependently. Compute P (X < Y). Explain what this probability is in terms of coin 
tossing. 

2.3.27 Suppose that X ~ Geometric(,/n). Compute limno P(X < n). 

2.3.28 Let X ~ Negative-Binomial(r, 0) and Y ~ Negative-Binomial(s, 0), with X 
and Y chosen independently. Let Z = X + Y. What will be the distribution of Z? 
(Explain your reasoning.) 

2.3.29 (Generalized hypergeometric distribution) Suppose that a set contains N ob- 
jects, Mı of which are labelled 1, M2 of which are labelled 2, and the remainder of 
which are labelled 3. Suppose we select a sample of n < N objects from the set using 
sampling without replacement, as described in Example 2.3.7. Determine the proba- 
bility that we obtain the counts (f1, f2, f3) where fi is the number of objects labelled 
i in the sample. 

2.3.30 Suppose that units arrive at a server according to a Poisson process at rate 4 (see 
Exercise 2.3.18). Let T be the amount of time until the first call. Calculate P(T > t). 


2.4 | Continuous Distributions 


In the previous section, we considered discrete random variables X for which P(X = 
x) > 0 for certain values of x. However, for some random variables X, such as one 
having the uniform distribution, we have P(X = x) = 0 for all x. This prompts the 
following definition. 


Definition 2.4.1 A random variable X is continuous if 


P(X =x)=0, 


for all x e R!. 


EXAMPLE 2.4.1 The Uniform[0, 1] Distribution 
Consider a random variable whose distribution is the uniform distribution on [0, 1], as 
presented in (1.2.2). That is, 


Pa<X <b)=6b-a, (2.4.2) 


whenever 0 <a < b < 1, with P(X < 0) = P(X > 1) = 0. The random variable X 
is said to have the Uniform[0, 1] distribution; we write this as X ~ Uniform[0, 1]. For 


example, 
1 3 3 1 1 
Pl-=<X<-)=---=-. 
27 ~ 4 4 2 4 

Also, 


p(x>5)=P(Fexet)+Pa>m=(1-5)+0=5. 


In fact, for any x e [0, 1], 
P(X <x) =P(X <0)4+P0< X <x) =04+(-0)=x. 


52 Section 2.4: Continuous Distributions 


Note that setting a = b = x in (2.4.2), we see in particular that PLY = x) = x — 
x = 0 for every x € R!. Thus, the uniform distribution is an example of a continuous 
distribution. In fact, it is one of the most important examples! E 


The Uniform[0, 1] distribution is fairly easy to work with. However, in general, 
continuous distributions are very difficult to work with. Because P(X = x) = 0 for 
all x, we cannot simply add up probabilities as we can for discrete random variables. 
Thus, how can we keep track of all the probabilities? 

A possible solution is suggested by rewriting (2.4.2), as follows. For x € R!, let 


1 O0<x<l 
FON | 0 otherwise. (203) 
Then (2.4.2) can be rewritten as 
b 
P(a<X <b) n. fœ)dx, (2.4.4) 


whenever a < b. 

One might wonder about the wisdom of converting the simple equation (2.4.2) into 
the complicated integral equation (2.4.4). However, the advantage of (2.4.4) is that, by 
modifying the function f, we can obtain many other continuous distributions besides 
the uniform distribution. To explore this, we make the following definitions. 


Definition 2.4.2 Let f : R! > R! bea function. Then f is a density function if 
f(x) > 0 forall x € R!, and f°. f(x)dx = 1. 


Definition 2.4.3 A random variable X is absolutely continuous if there is a density 
function f, such that 


b 
P(a<X <b) =| f(x) dx, (2.4.5) 


whenever a < b, as in (2.4.4). 


In particular, if b = a + ô, with ô a small positive number, and if f is continuous at a, 
then we see that 


a+ô 
Pla eXsatd= | flx)dx © 6 f(a). 


Thus, a density function evaluated at a may be thought of as measuring the probability 
of a random variable being in a small interval about a. 

To better understand absolutely continuous random variables, we note the following 
theorem. 


Theorem 2.4.1 Let X be an absolutely continuous random variable. Then X is a 


continuous random variable, i.e., P(X =a) = 0 for alla € R!. 


Chapter 2: Random Variables and Distributions 53 


PROOF | Let a be any real number. Then P(X = a) = P(a < X < a). On the 
other hand, setting a = b in (2.4.5), we see that P(a < X < a) = ff f(x)dx = 


Hence, P(X = a) = 0 for all a, as required. E 


It turns out that the converse to Theorem 2.4.1 is false. That is, not all continuous 
distributions are absolutely continuous.! However, most of the continuous distributions 
that arise in statistics are absolutely continuous. Furthermore, absolutely continuous 
distributions are much easier to work with than are other kinds of continuous distribu- 
tions. Hence, we restrict our discussion to absolutely continuous distributions here. In 
fact, statisticians sometimes say that X is continuous as shorthand for saying that X is 
absolutely continuous. 


2.4.1 | Important Absolutely Continuous Distributions 


Certain absolutely continuous distributions are so important that we list them here. 


EXAMPLE 2.4.2 The Uniform[0, 1] Distribution 

Clearly, the uniform distribution is absolutely continuous, with the density function 
given by (2.4.3). We will see, in Section 2.10, that the Uniform[0, 1] distribution has 
an important relationship with every absolutely continuous distribution. 


EXAMPLE 2.4.3 The Uniform[L , R] Distribution 
Let L and R be any two real numbers with L < R. Consider a random variable X such 
that 


b- 
PaasX<b)=7— 


(2.4.6) 


whenever L <a < b < R, with P(X < L) = P(X > R) =0. The random variable 
X is said to have the Uniform[Z, R] distribution; we write this as X ~ Uniform[L, R]. 
(If L = 0 and R = 1, then this definition coincides with the previous definition of the 
Uniform[0, 1] distribution.) Note that X ~ Uniform[L, R] has the same probability of 
being in any two subintervals of [L, R] that have the same length. 

Note that the Uniform[Z, R] distribution is also absolutely continuous, with density 
given by 


1 
— L<x<R 
—J R-L =45 
J) | 0 otherwise. 
In Figure 2.4.1 we have plotted a Uniform[2, 4] density. E 


lFor examples of this, see more advanced probability books, e.g., page 143 of A First Look at Rigorous 
Probability Theory, Second Edition, by J. S. Rosenthal (World Scientific Publishing, Singapore, 2006). 


54 Section 2.4: Continuous Distributions 


0.5 7 


Figure 2.4.1: A Uniform[2, 4] density function. 


EXAMPLE 2.4.4 The Exponential(1) Distribution 
Define a function f : R! > R! by 


fo={5 a 
Then clearly, f(x) > 0 for all x. Also, 
oO OO fore) 
i fe)dx = | e™dx =—e™ | =(-0)-(-l)=1. 
—oo 0 0 


Hence, f is a density function. See Figure 2.4.2 for a plot of this density. 
Consider now a random variable X having this density function f. If0 <a < b < 
oo, then 


b b 
P(a<X <b) =| f(x) dx =| e™ dx (ee Ge Se Se. 


The random variable X is said to have the Exponential(1) distribution, which we write 
as X ~ Exponential(1). The exponential distribution has many important properties, 
which we will explore in the coming sections. E 


EXAMPLE 2.4.5 The Exponential(A) Distribution 
Let 4 > 0 bea fixed constant. Define a function f : R! > R! by 


—Ax x>0 


Be : 


Then clearly, f (x) > 0 for all x. Also, 


[. f(x) dx a 1e™™ dx = e7” = (-0) - (-1)=1. 


Chapter 2: Random Variables and Distributions 55 


Hence, f is again a density function. (If 4 = 1, then this corresponds to the Exponential (1) 
density.) 
If X is a random variable having this density function f, then 


b 

Pa<X <b) =f de® dx = (—e7?”) — (-e774) = e7% — e7% 
a 

fr0 < a < b < œ. The random variable X is said to have the Exponential(4) 

distribution; we write this as X ~ Exponential(4). Note that some books and software 

packages instead replace 2 by 1/4 in the definition of the Exponential(,) distribution 

— always check this when using another book or when using software. 

An exponential distribution can often be used to model lifelengths. For example, a 
certain type of light bulb produced by a manufacturer might follow an Exponential (4) 
distribution for an appropriate choice of 4. By this we mean that the lifelength X of a 
randomly selected light bulb from those produced by this manufacturer has probability 


OO 

P(X >x) =| he dz = e7% 
x 

of lasting longer than x, in whatever units of time are being used. We will see in 

Chapter 3 that, in a specific application, the value 1/2 will correspond to the average 

lifelength of the light bulbs. 

As another application of this distribution, consider a situation involving a server, 
e.g., a cashier at a fast-food restaurant, an automatic bank teller machine, a telephone 
exchange, etc. Units arrive for service in a random fashion and form a queue when the 
server is busy. It is often the case that the number of arrivals at the server, for some 
specific unit of time ¢, can be modeled by a Poisson(Af) distribution. Now let 7; be the 
time until the first arrival. Then we have 


0 
QA) pat ee 


P (1 > t) = P (no arrivals in (0, t]) = a 


and T; has density given by 


d f[® __ 4d ees: 
6o=-; f f@) dz=-SP (T >) = 1. 


So 7; ~ Exponential(4). E 


EXAMPLE 2.4.6 The Gamma(a, 4) Distribution 
The gamma function is defined by 


oO 
T (a) =i tle dt, a>o0. 
0 
It turns out (see Problem 2.4.15) that 
T(a+1)=aT(a) (2.4.7) 


and that if n is a positive integer, then T (n) = (n — 1)!, while 01/2) = yr. 


56 Section 2.4: Continuous Distributions 


We can use the gamma function to define the density of the Gamma(a, 2) distribu- 
tion, as follows. Let a > 0 and 4 > 0, and define a function f by 


aya—l 
atx —Ax 


rae (2.4.8) 


SQ) = 


when x > 0, with f(x) = 0 forx < 0. Then clearly f > 0. Furthermore, it is 
not hard to verify (see Problem 2.4.17) that de f(x) dx = 1. Hence, f is a density 
function. 

A random variable X having density function f given by (2.4.8) is said to have the 
Gamma(a, 4) distribution; we write this as X ~ Gamma(a, 1). Note that some books 
and software packages instead replace A by 1/2 in the definition of the Gamma(a, 2) 
distribution — always check this when using another book or when using software. 

The case a = | corresponds (because T (1) = 0! = 1) to the Exponential(/) dis- 
tribution: Gamma(1, 2) = Exponential(/). In Figure 2.4.2, we have plotted several 
Gamma(a, 4) density functions. 


f 

0.8 7 

0.6 7 

0.47 

0.27 
J7 

oyt - + T t } t T t T maa 
0 1 2 3 4 5 6 


Figure 2.4.2: Graph of an Exponential(1) (solid line), a Gamma(2, 1) (dashed line), and a 
Gamma(3, 1) (dotted line) density. 


A gamma distribution can also be used to model lifelengths. As Figure 2.4.2 shows, 
the gamma family gives a much greater variety of shapes to choose from than from the 
exponential family. E 


We now define a function 6: R! > R! by 


(x) = ae"? (2.4.9) 


V2n 


This function ¢ is the famous “bell-shaped curve” because its graph is in the shape of 
a bell, as shown in Figure 2.4.3. 


Chapter 2: Random Variables and Distributions 57 


Figure 2.4.3: Plot of the function ¢ in (2.4.9). 


We have the following result for ¢. 


Theorem 2.4.2 The function ¢ given by (2.4.9) is a density function. 


PROOF | See Section 2.11 for the proof of this result. E 


This leads to the following important distributions. 


EXAMPLE 2.4.7 The N(0, 1) Distribution 
Let X be a random variable having the density function ¢ given by (2.4.9). This means 
that for —o0 < a < b < œ, 


b b 
Pla exsv= f odr = | oe 


a 


The random variable X is said to have the N (0, 1) distribution (or the standard normal 
distribution); we write this as X ~ N(0, 1). ll 


EXAMPLE 2.4.8 The N (u, o?) Distribution 
Let u € R!, and let ø > 0. Let f be the function defined by 


Eal ea ee 
o oN 2m 


fo) == 4 


(If u = 0 and o = 1, then this corresponds with the previous example.) Clearly, 
f 2 0. Also, letting y = (x — u)/c, we have 


f. f@)dx = f_o- ax = fso dy =| sa — |. 


Hence, f is a density function. 
Let X be a random variable having this density function f. The random variable 
X is said to have the N (u, ø?) distribution; we write this as X ~ N(w, 0”). In Figure 


58 Section 2.4: Continuous Distributions 


2.4.4, we have plotted the N(0, 1) and the N (1, 1) densities. Note that changes in u 
simply shift the density without changing its shape. In Figure 2.4.5, we have plotted 
the N(0, 1) and the N(0, 4) densities. Note that both densities are centered on 0, but 
the N(0, 4) density is much more spread out. The value of ø? controls the amount of 
spread. 


Figure 2.4.5: Graph of an N (0, 4) density (solid line) and an N (0, 1) density (dashed line). 


The N (u, a°) distribution, for some choice of u and 7, arises quite often in ap- 
plications. Part of the reason for this is an important result known as the central limit 
theorem. which we will discuss in Section 4.4. In particular, this result leads to using 
a normal distribution to approximate other distributions, just as we used the Poisson 
distribution to approximate the binomial distribution in Example 2.3.6. 

In a large human population, it is not uncommon for various body measurements to 
be normally distributed (at least to a reasonable degree of approximation). For example, 
let us suppose that heights (measured in feet) of students at a particular university are 
distributed N (u, o?) for some choice of u and o”. Then the probability that a randomly 


Chapter 2: Random Variables and Distributions 59 


selected student has height between a and b feet, with a < b, is given by 


b 
i l o0- 20? Jy. 
a ON2T 


In Section 2.5, we will discuss how to evaluate such an integral. Later in this text, we 
will discuss how to select an appropriate value for u and ø? and to assess whether or 
not any normal distribution is appropriate to model the distribution of a variable defined 
on a particular population. E 


Given an absolutely continuous random variable X, we will write its density as fy, 
or as f if no confusion arises. Absolutely continuous random variables will be used 
extensively in later chapters of this book. 


Remark 2.4.1 Finally, we note that density functions are not unique. Indeed, if f is a 
density function and we change its value at a finite number of points, then the value of 
fi } f(x)dx will remain unchanged. Hence, the changed function will also qualify as 
a density corresponding to the same distribution. On the other hand, often a particular 
“best” choice of density function is clear. For example, if the density function can be 
chosen to be continuous, or even piecewise continuous, then this is preferred over some 
other version of the density function. 

To take a specific example, for the Uniform[0, 1] distribution, we could replace the 
density f of (2.4.3) by 


(x) = 1 0<x<l 
B=) 9 otherwise, 
or even by 
1 0<x <3/4 
Jv x =3/4 
MOAS 3/4 <x <l 
0 otherwise. 


Either of these new densities would again define the Uniform[0, 1] distribution, be- 
cause we would have f fœ)dx = f g(x)dx = f h(x)dx for anya < b. 

On the other hand, the densities f and g are both piecewise continuous and are 
therefore natural choices for the density function, whereas h is an unnecessarily com- 
plicated choice. Hence, when dealing with density functions, we shall always assume 
that they are as continuous as possible, such as f and g, rather than having removable 
discontinuities such as A. This will be particularly important when discussing likeli- 
hood methods in Chapter 6. E 


Summary of Section 2.4 


e A random variable X is continuous if P(X = x) = 0 for all x, i.e., if none of its 
probability comes from being equal to particular values. 

e X is absolutely continuous if there exists a density function fy with P(a < X < 
b) = f? fx) dx for alla <b. 


60 Section 2.4: Continuous Distributions 


e Important absolutely continuous distributions include the uniform, exponential, 
gamma, and normal. 


EXERCISES 


2.4.1 Let U ~ Uniform[0, 1]. Compute each of the following. 

(a) PU <0) 

(b) PU = 1/2) 

(c) PU < —1/3) 

(d) P(U < 2/3) 

(e) PU < 2/3) 

() PU <1) 

(g) PU < 17) 

2.4.2 Let W ~ Uniform[1, 4]. Compute each of the following. 

(a) P(W > 5) 

(b) PW > 2) 

(c) P(W* < 9) (Hint: If W? < 9, what must W be?) 

(d PW? <2) 

2.4.3 Let Z ~ Exponential(4). Compute each of the following. 

(a) P(Z > 5) 

(b) P(Z > —5) 

(©) P(Z? > 9) 

(d) P(Z* — 17 > 9) 

2.4.4 Establish for which constants c the following functions are densities. 

(a) f(x) = cx on (0, 1) and 0 otherwise. 

(b) f@) = cx” on (0, 1) and 0 otherwise, for n a nonnegative integer. 

(c) f(x) = cx!/? on (0, 2) and 0 otherwise. 

(d) f(x) = csinx on (0, 2/2) and 0 otherwise. 

2.4.5 Is the function defined by f(x) = x/3 for —1 < x < 2 and 0 otherwise, a 
density? Why or why not? 

2.4.6 Let X ~ Exponential(3). Compute each of the following. 

(a) PO <X <1) 

(b) PO < X <3) 

(c) PO < X <5) 

(d) P22 < X <5) 

(e) PĒ < X < 10) 

(f) P(X > 2) 

2.4.7 Let M > 0, and suppose f(x) = cx? for 0 < x < M, otherwise f(x) = 0. For 
what value of c (depending on M) is f a density? 

2.4.8 Suppose X has density f and that f(x) > 2 for 0.3 < x < 0.4. Prove that 
P(0.3 < X < 0.4) > 0.2. 

2.4.9 Suppose X has density f and Y has density g. Suppose f(x) > g(x) for 1 < 
x < 2. Prove that P(1 < X <2) > PU < Y <2). 


Chapter 2: Random Variables and Distributions 61 


2.4.10 Suppose X has density f and Y has density g. Is it possible that f(x) > g(x) 
for all x? Explain. 

2.4.11 Suppose X has density f and f(x) > f(y) whenever 0 <x <1 <y <2. 
Does it follow that P(O < X < 1) > P(1 < X < 2)? Explain. 

2.4.12 Suppose X has density f and f(x) > f(y) whenever 0 <x <1 <y <3. 
Does it follow that P(0 < X < 1) > P(1 < X < 3)? Explain. 

2.4.13 Suppose X ~ N (0, 1) and Y ~ N(1, 1). Prove that P(X < 3) > P(Y < 3). 


PROBLEMS 


2.4.14 Let Y ~ Exponential(4) for some 1 > 0. Let y,h > 0. Prove that P(Y — 
h > y|Y > h) = P(Y > y). That is, conditional on knowing that Y > h, the 
random variable Y — h has the same distribution as Y did originally. This is called 
the memoryless property of the exponential distributions; it says that they immediately 
“forget” their past behavior. 

2.4.15 Consider the gamma function T (a) = tar t*-le—dt, for a > 0. 

(a) Prove that T (a + 1) = a T (a). (Hint: Use integration by parts.) 

(b) Prove that T (1) = 1. 

(c) Use parts (a) and (b) to show that T (n) = (n — 1)! ifn is a positive integer. 

2.4.16 Use the fact that T (1/2) = ,/x to give an alternate proof that [°° p(x) dx 

= | (as in Theorem 2.4.2). (Hint: Make the substitution t = x? /2.) 

2.4.17 Let f be the density of the Gamma(a, 4) distribution, as in (2.4.8). Prove that 
Jo. fe) dx = 1. (Hint: Let t = Ax.) 

2.4.18 (Logistic distribution) Consider the function given by f(x) = 

e* (1 +e™ ie for —oo < x < oo. Prove that f is a density function. 

2.4.19 (Weibull(a) distribution) Consider, for a > 0 fixed, the function given by 
f(x) = ax*le™ for 0 < x < co and 0 otherwise. Prove that f is a density 
function. 

2.4.20 (Pareto(a) distribution) Consider, for a > 0 fixed, the function given by f(x) = 
a(l + xy et for 0 <x < co and 0 otherwise. Prove that f is a density function. 
2.4.21 (Cauchy distribution) Consider the function given by 


for —oo < x < oo. Prove that f is a density function. (Hint: Recall the derivative of 
arctan(x) .) 

2.4.22 (Laplace distribution) Consider the function given by f(x) = e7"!/2 for 
—oo <x < œ and 0 otherwise. Prove that f is a density function. 

2.4.23 (Extreme value distribution) Consider the function given by f(x) = e™* exp {-e*} 
for —oco < x < oo and 0 otherwise. Prove that f is a density function. 

2.4.24 (Beta(a, b) distribution) The beta function is the function B : (0, oo)? — R! 
given by 


1 
B (a,b) -f x1! (1 — x)! dx. 
0 


62 Section 2.5: Cumulative Distribution Functions 


It can be proved (see Challenge 2.4.25) that 
T (a)T (b) 
T (a+b) 


(a) Prove that the function f given by f(x) = B7! (a,b) x?! (1 — x)?! , for 0 < 
x < 1 and 0 otherwise, is a density function. 


B (a,b) = (2.4.10) 


(b) Determine and plot the density when a = 1, b = 1. Can you name this distribution? 
(c) Determine and plot the density when a = 2, b = 1. 
(d) Determine and plot the density when a = 1, b = 2. 
(e) Determine and plot the density when a = 2, b =2. 


CHALLENGES 


2.4.25 Prove (2.4.10). (Hint: Use T (a) T (b) = fyo foo x27! y? le” dx dy and 
make the change of variable u = x + y, vo = x /u.) 


DISCUSSION TOPICS 


2.4.26 Suppose X ~ N (0, 1) and Y ~ N (0, 4). Which do you think is larger, P(X > 
2) or P(Y > 2)? Why? (Hint: Look at Figure 2.4.5.) 


2.5 | Cumulative Distribution Functions 


If X is a random variable, then its distribution consists of the values of P(X e€ B) for 
all subsets B of the real numbers. However, there are certain special subsets B that are 
convenient to work with. Specifically, if B = (—oo, x] for some real number x, then 
P(X e B) = P(X < x). It turns out (see Theorem 2.5.1) that it is sufficient to keep 
track of P(X < x) for all real numbers x. 

This motivates the following definition. 


Definition 2.5.1 Given a random variable X, its cumulative distribution function 
(or distribution function, or cdf for short) is the function Fy : R! > [0, 1], defined 


by Fx(x) = P(X < x). (Where there is no confusion, we sometimes write F (x) 
for Fy(x).) 


The reason for calling Fy the “distribution function” is that the full distribution 
of X can be determined directly from Fy. We demonstrate this for some events of 
particular importance. 

First, suppose that B = (a, b] is a left-open interval. Using (1.3.3), 


P(X € B) = P(a < X < b) = P(X < b) — P(X < a) = Fy(b) — Fy(a). 


Now, suppose that B = [a, b] is a closed interval. Using the continuity of proba- 
bility (see Theorem 1.6.1), we have 


P(XeB) = P(a < X <b) = lim P(a—1/n <X <b) 
fim (Fx@) — Fx(a — 1/n)) = Fx(b) — lim Fx(a — 1/n). 


Chapter 2: Random Variables and Distributions 63 
We sometimes write lim, 5. Fy(a — 1/n) as Fy(a7), so that P(X e [a,b]) = 
Fy(b) — Fy(a_ ). In the special case where a = b, we have 
P(X =a) = Fx(a) — Fy (a`). (2.5.1) 
Similarly, if B = (a, b) is an open interval, then 
P(X e B)=P(a < X <b)= lim Fy(b — 1/n) — Fy(@) = Fy(b7) — Fy (a). 
n OO 

If B = [a, b) is a right-open interval, then 

P(X eB) = P(a<X<b)= lim Fy(b —1/n)— lim Fx(a — 1/n) 

n= oo n—- co 

Fy(6-) — Fx@). 


We conclude that we can determine P(X e B) from Fy whenever B is any kind of 
interval. 

Now, if B is instead a union of intervals, then we can use additivity to again com- 
pute P(X € B) from Fy. For example, if 


B = (a, b1] U (a2, b2] U---U (ak, bk], 
with a, < bı < a2 < b2 < --- < ak < bp, then by additivity, 


P(X€B) = P(X e(a, bı) +--+ P(X € (ar, be) 
= Fy(bı)— Fx(aı) +--+ + Fy(bx) — Fx (az). 


Hence, we can still compute P(X € B) solely from the values of Fy (x). 


Theorem 2.5.1 Let X be any random variable, with cumulative distribution func- 
tion Fy. Let B be any subset of the real numbers. Then P(X e B) can be deter- 


mined solely from the values of Fy (x). 


(Outline) It turns out that all relevant subsets B can be obtained by apply- 
ing limiting operations to unions of intervals. Hence, because Fy determines P(X € 
B) when B is a union of intervals, it follows that Fy determines P(X e B) for all 
relevant subsets B. E 


2.5.1 | Properties of Distribution Functions 


In light of Theorem 2.5.1, we see that cumulative distribution functions Fy are very 
useful. Thus, we note a few of their basic properties here. 


Theorem 2.5.2 Let Fy be the cumulative distribution function of a random variable 
X. Then 


(a) 0 < Fy(x) < 1 forall x, 


(b) Fy(x) < Fy) whenever x < y (i.e., Fy is increasing), 
(c) lim; 5400 ae = 1, 
(dj lims- Fy (x) = 0. 


64 Section 2.5: Cumulative Distribution Functions 


PROOF | (a) Because Fy(x) = P(X < x) is a probability, it is between 0 and 1. 


(b) Let A = {X < x} and B = {X < y}. Thenifx < y, then A C B, so that 
P(A) < P(B). But P(A) = Fy(x) and P(B) = Fy(y), so the result follows. 


(c) Let Ay = {X < n}. Because X must take on some value and hence X < n for 
sufficiently large n, we see that {4n} increases to S, i.e., {An} Z S (see Section 1.6). 
Hence, by continuity of P (see Theorem 1.6.1), limy5o. P(An) = P(S) = 1. But 
P(An) = P(X < n) = Fy(n), so the result follows. 


(d) Let B, = {X < —n}. Because X > —n for sufficiently large n, {Bn} decreases 
to the empty set, i.e., {Bn} N Ø. Hence, again by continuity of P, limno P (Bn) = 
P(®) = 0. But P(B,) = P(X < —n) = Fy(—n), so the result follows. E 


If Fy is a cumulative distribution function, then Fy is also right continuous; see Prob- 
lem 2.5.17. It turns out that if a function F : R! — R! satisfies properties (a) through 


(d) and is right continuous, then there is a unique probability measure P on R! such 
that F is the cdf of P. We will not prove this result here.” 


2.5.2 | Cdfs of Discrete Distributions 


We can compute the cumulative distribution function (cdf) Fy of a discrete random 
variable from its probability function p y, as follows. 


Theorem 2.5.3 Let X bea discrete random variable with probability function py. 


Then its cumulative distribution function Fy satisfies Fy (x) = D <x PXO). 


PROOF | Let x}, x2,... be the possible values of X. Then Fy(x) = P(X < x) = 
pee P(X =x;) = Diet P(X =y)= Digby px(y), as claimed. E 


Hence, if X is a discrete random variable, then by Theorem 2.5.3, Fy is piecewise 
constant, with a jump of size py(x;) at each value x;. A plot of such a distribution 
looks like that depicted in Figure 2.5.1. 

We consider an example of a distribution function of a discrete random variable. 


EXAMPLE 2.5.1 

Consider rolling one fair six-sided die, so that S = {1,2,3,4, 5, 6}, with P(s) = 1/6 
for each s € S. Let X be the number showing on the die divided by 6, so that X(s) = 
s/6 fors € S. What is Fy(x)? Since X(s) < x if and only ifs < 6x, we have that 


Fræ =PX<»= $} ROS: >, 5a giiseS:s < 6x. 


seS,5<6x seS, s<6x 


2For example, see page 67 of A First Look at Rigorous Probability Theory, Second Edition, by J. S. Rosen- 
thal (World Scientific Publishing, Singapore, 2006). 


Chapter 2: Random Variables and Distributions 65 


That is, to compute F'y(x), we count how many elements s € S satisfy s < 6x and 
multiply that number by 1/6. Therefore, 


0 x < 1/6 

1/6 1/6<x <2/6 

2/6 2/6 <x <3/6 
Fy(x) = 4 3/6 3/6 <x <4/6 

4/6 4/6 <x <5/6 

5/6 5/6<x <1 

6/6 l<x. 


In Figure 2.5.1, we present a graph of the function Fy and note that this is a step 
function. Note (see Exercise 2.5.1) that the properties of Theorem 2.5.2 are indeed 
satisfied by the function Fy. 


Figure 2.5.1: Graph of the cdf Fy in Example 2.5.1. 


2.5.3 | Cdfs of Absolutely Continuous Distributions 


Once we know the density fy of X, then it is easy to compute the cumulative distribu- 
tion function of X, as follows. 


Theorem 2.5.4 Let X be an absolutely continuous random variable, with density 
function fy. Then the cumulative distribution function Fy of X satisfies 


PoS L P 


for x e R!. 


PROOF | This follows from (2.4.5), by setting b = x and letting a > —oo. E 


From the fundamental theorem of calculus, we see that it is also possible to compute 
a density fy once we know the cumulative distribution function Fy. 


66 Section 2.5: Cumulative Distribution Functions 


Corollary 2.5.1 Let X be an absolutely continuous random variable, with cumula- 
tive distribution function Fy. Let 


d 
fixe) = Fx) = Fe). 


Then fy is a density function for X. 


We note that Fy might not be differentiable everywhere, so that the function fy of the 
corollary might not be defined at certain isolated points. The density function may take 
any value at such points. 

Consider again the N (0, 1) distribution, with density ¢ given by (2.4.9). According 
to Theorem 2.5.4, the cumulative distribution function F of this distribution is given 


by 
rasy. Oar = | a ie 


tere) T 
It turns out that it is provably impossible to evaluate this integral exactly, except for 
certain specific values of x (e.g., x = —oo, x = 0, or x = œ). Nevertheless, the 
cumulative distribution function of the N(0, 1) distribution is so important that it is 
assigned a special symbol. Furthermore, this is tabulated in Table D.2 of Appendix D 
for certain values of x. 


Definition 2.5.2 The symbol ® stands for the cumulative distribution function of 
a standard normal distribution, defined by 


ou) = | onde = f ma (2.5.2) 


—0o T 


EXAMPLE 2.5.2 Normal Probability Calculations 
Suppose that X ~ N (0, 1), and we want to calculate 


P(—0.63 < X < 2.0) = P(X < 2.0) — P(X < —0.63). 


Then P(X < 2) = (2), while P(X < —0.63) = ®(—0.63). Unfortunately, 
@(2) and ®(—0.63) cannot be computed exactly, but they can be approximated us- 
ing a computer to numerically calculate the integral (2.5.2). Virtually all statistical 
software packages will provide such approximations, but many tabulations such as 
Table D.2, are also available. Using this table, we obtain ®(2) = 0.9772, while 
®(—0.63) = 0.2643. This implies that 


P(—0.63 < X < 2.0) = ®(2.0) — ®(—0.63) = 0.9772 — 0.2643 = 0.7129. 


Now suppose that X ~ N (u, 07), and we want to calculate P (a < X < b). Letting 
f denote the density of X and following Example 2.4.8, we have 


b b 2 
P(a exsb= | feds = | o6(—*) ax 


Chapter 2: Random Variables and Distributions 67 


Then, again following Example 2.4.8, we make the substitution y = (x — )/o in the 
above integral to obtain 


bau 
Paasx<b= [* soas = 0 (7=*) -o (—*). 


Therefore, general normal probabilities can be computed using the function ®. 
Suppose now that a = —0.63, b = 2.0, u = 1.3, and ø? = 4. We obtain 


2.0-1. —0.63 — 1. 
o( 0 =) -0/ 0.63 =) 
2 2 


® (0.35) — ® (0.965) = 0.6368 — 0.16725 
0.46955 


P(-0.63 < X < 2.0) 


because, using Table D.2, ® (0.35) = 0.6368. We approximate ® (—0.965) by the 
linear interpolation between the values ® (—0.96) = 0.1685, © (—0.97) = 0.1660, 
given by 


® (—0.97) — © (—0.96) 


® (—0.965) ~ @®(—0.96) + ——————____——  (—0.965 — (—0.96 
( ) ( yer —0.97 — (—0.96) ( ( )) 
0.1660 — 0.1685 
= 0.1685 + ——————. (- 0.965 — (—0.96)) = 0.16725. E 
F —0.97 — (—0.96) ( ( )) 

EXAMPLE 2.5.3 

Let X be a random variable with cumulative distribution function given by 
0 x <2 

Fy(x) = 4 (x —2)*/16 2<x<4 

1 4<x. 


In Figure 2.5.2, we present a graph of Fy. 


Figure 2.5.2: Graph of the cdf Fy in Example 2.5.3. 


Suppose for this random variable X we want to compute P(X < 3), P(X < 3), 
P(X > 2.5), and P(1.2 < X < 3.4). We can compute all these probabilities directly 


68 Section 2.5: Cumulative Distribution Functions 


from Fy. We have that 


P(X <3) = FyG)=(G6-2)'/16 =1/16, 
P(X <3) = Fx(37) = lim G — (1/n) — 2)*/16 = 1/16, 
P(X>2.5) = 1— P(X < 2.5) = 1 — Fy(2.5) 
= 1—(2.5—2)4/16 = 1 — 0.0625/16 = 0.996, 
P(1.2<X¥<3.4) = FyQA4)— Fy(1.2) = (8.4 — 2)f/16 — 0 =0.2401.8 


2.5.4 | Mixture Distributions 


Suppose now that F1, F2,..., Fg are cumulative distribution functions, correspond- 
ing to various distributions. Also let p1, p2,..., pk be positive real numbers with 
Si pi = 1 (so these values form a probability distribution). Then we can define a 
new function G by 


G(x) = piFi&) + pF) + +++ + pF). (2.5.3) 


It is easily verified (see Exercise 2.5.6) that the function G given by (2.5.3) will 
satisfy properties (a) through (d) of Theorem 2.5.2 and is right continuous. Hence, G 
is also a cdf. 

The distribution whose cdf is given by (2.5.3) is called a mixture distribution be- 
cause it mixes the various distributions with cdfs F1, ..., Fk according to the probabil- 
ity distribution given by the p1, p2,..., Pk. 

To see how a mixture distribution arises in applications, consider a two-stage sys- 
tem, as discussed in Section 1.5.1. Let Z be a random variable describing the outcome 
of the first stage and such that P(Z = i) = p; fori = 1,2,...,k. Suppose that for 
the second stage, we observe a random variable Y where the distribution of Y depends 
on the outcome of the first stage, so that Y has cdf F; when Z = i. In effect, F; is the 
conditional distribution of Y, given that Z = i (see Section 2.8). Then, by the law of 
total probability (see Theorem 1.5.1), the distribution function of Y is given by 


k k 
PY <y)=)>) PU <y|Z =) PZ =i) => pF 0) =G0). 
i=l i=l 
Therefore, the distribution function of Y is given by a mixture of the F;. 
Consider the following example of this. 


EXAMPLE 2.5.4 

Suppose we have two bowls containing chips. Bowl #1 contains one chip labelled 
0, two chips labelled 3, and one chip labelled 5. Bowl #2 contains one chip labelled 
2, one chip labelled 4, and one chip labelled 5. Now let X; be the random variable 
corresponding to randomly drawing a chip from bowl #i. Therefore, P(X; = 0) = 
1/4, P(X, = 3) = 1/2, and P(X; = 5) = 1/4, while P(X2 = 2) = P(X. = 4) = 


Chapter 2: Random Variables and Distributions 69 


P(X2 = 5) = 1/3. Then X; has distribution function given by 


0 x <0 
_ | 1⁄4 O<x <3 
AC= 1 374 3<x <5 
1 x>5 


and X7 has distribution function given by 


0 x <2 
1/3 2<x <4 

FQ) = 2/3 4<x <5 
1 xX SD: 


Now suppose that we choose a bowl by randomly selecting a card from a deck of 
five cards where one card is labelled 1 and four cards are labelled 2. Let Z denote the 
value on the card obtained, so that P (Z = 1) = 1/5 and P (Z =2) = 4/5. Then, 
having obtained the value Z = i, we observe Y by randomly drawing a chip from bowl 
#i. We see immediately that the cdf of Y is given by 


G(x) = (1/5) Fix) + 4/5) Fo), 
and this is a mixture of the cdfs F and F2. E 


As the following examples illustrate, it is also possible to have infinite mixtures of 
distributions. 


EXAMPLE 2.5.5 Location and Scale Mixtures 

Suppose F is some cumulative distribution function. Then for any real number y, the 
function F, defined by Fy (x) = F(x — y) is also a cumulative distribution function. In 
fact, F, is just a “shifted” version of F. An example of this is depicted in Figure 2.5.3. 


1.07 


F 


Figure 2.5.3: Plot of the distribution functions F (solid line) and F> (dashed line) in Example 
2.5.5, where F(x) = e*/ (e* + 1) for x € R!. 


If p; > 0 with $; p; = 1 (so the p; form a probability distribution), and y1, y2,... 
are real numbers, then we can define a discrete location mixture by 


Hx) = DU piFy&) =D) iF — yi). 


70 Section 2.5: Cumulative Distribution Functions 


Indeed, the shift Fy (x) = F (œx — y) itself corresponds to a special case of a discrete 
location mixture, with pı = 1 and yı = y. 

Furthermore, if g is some nonnegative function with TE g(t)dt = 1(sogisa 
density function), then we can define 


H(x) = f F,(x) g(y) dy = iL F(x —y)g(y) ay. 


Then it is not hard to see that H is also a cumulative distribution function — one that is 
called a continuous location mixture of F. The idea is that H corresponds to a mixture 
of different shifted distributions Fy, with the density g giving the distribution of the 
mixing coefficient y. 

We can also define a discrete scale mixture by 


K(x) =D) pF @/yi) 


whenever y; > 0, pi > 0, and X}; pi = 1. Similarly, if fo g® dt = 1, then we can 
write 


K@) = ” FR /)EO) ay. 


Then K is also a cumulative distribution function, called a continuous scale mixture of 
Fil 


You might wonder at this point whether a mixture distribution is discrete or con- 
tinuous. The answer depends on the distributions being mixed and the mixing distrib- 
ution. For example, discrete location mixtures of discrete distributions are discrete and 
discrete location mixtures of continuous distributions are continuous. 

There is nothing restricting us, however, to mixing only discrete distributions or 
only continuous distributions. Other kinds of distribution are considered in the follow- 
ing section. 


2.5.5 | Distributions Neither Discrete Nor Continuous (Advanced) 


There are some distributions that are neither discrete nor continuous, as the following 
example shows. 


EXAMPLE 2.5.6 

Suppose that Xı ~ Poisson(3) is discrete with cdf F1, while X2 ~ N (0, 1) is continu- 
ous with cdf F2, and Y has the mixture distribution given by Fy @) = (1/5) Fi(v) + 
(4/5) Fa). Using (2.5.1), we have 


PY=y) = Fy¥)-FrvO") 
= 0/5)Fi0) + 4/5) a) -0/50 ) — 4/5)F207) 
= (1/5) (A0) - F07) + 4/5) (RO) - 207) 


1 4 


Chapter 2: Random Variables and Distributions 71 


Therefore, 
13 e53 y a nonnegative integer 
PU =y)=] 2” : 
0 otherwise. 
Because P(Y = y) > 0 for nonnegative integers y, the random variable Y is not 
continuous. On the other hand, we have 


Hence, Y is not discrete either. 
In fact, Y is neither discrete nor continuous. Rather, Y is a mixture of a discrete 
and a continuous distribution. E 


For the most part in this book, we shall treat discrete and continuous distributions 
separately. However, it is important to keep in mind that actual distributions may be 
neither discrete nor continuous but rather a mixture of the two.? In most applications, 
however, the distributions we deal with are either continuous or discrete. 

Recall that a continuous distribution need not be absolutely continuous, i.e., have a 
density. Hence, a distribution that is a mixture of a discrete and a continuous distribu- 
tion might not be a mixture of a discrete and an absolutely continuous distribution. 


Summary of Section 2.5 


e The cumulative distribution function (cdf) of X is Fy(x) = P(X < x). 

e All probabilities associated with X can be determined from Fy. 

e Asx increases from —oo to oo, Fy (x) increases from 0 to 1. 

e If X is discrete, then Fy (Œ~) = DNS P(X =y). 

e If X is absolutely continuous, then Fy(x) = Sis fx(t) dt, and fy(x) = Fy(x). 
e We write ®(x) for the cdf of the standard normal distribution evaluated at x. 


e A mixture distribution has a cdf that is a linear combination of other cdfs. Two 
special cases are location and scale mixtures. 


e Some mixture distributions are neither discrete nor continuous. 


EXERCISES 


2.5.1 Verify explicitly that properties (a) through (d) of Theorem 2.5.2 are indeed sat- 
isfied by the function Fy in Example 2.5.1. 

2.5.2 Consider rolling one fair six-sided die, so that S = {1, 2,3, 4,5, 6}, and P (s) = 
1/6 for all s € S. Let X be the number showing on the die, so that X (s) = s for s € S. 
Let Y = X°. Compute the cumulative distribution function Fy (y) = P(Y < y), for all 
y € R!. Verify explicitly that properties (a) through (d) of Theorem 2.5.2 are satisfied 
by this function Fy. 


31n fact, there exist probability distributions that cannot be expressed even as a mixture of a discrete and 
a continuous distribution, but these need not concern us here. 


72 Section 2.5: Cumulative Distribution Functions 


2.5.3 For each of the following functions F, determine whether or not F is a valid 
cumulative distribution function, i.e., whether or not F satisfies properties (a) through 
(d) of Theorem 2.5.2. 

(a) F(x) =x forall x e R! 


(b) 
0 x <0 
Fa@)=}4x 0<x<1 
1 x>1 
(c) 
0 x <0 
F(x) =4 x? 02x <1 
1 x>l 
(d) 
0 x <0 
F(x) =4¢ x? O0<x <3 
1 p 
(e) 
0 x <0 
F(x) =} x2/9 0<x<3 
1 x>3 
(f) 
0 x <l 
F(x) = 4 x?/9 l<x <3 
1 x >3 
(g) 
0 x<-l 
F(x) =4 x?/9 -l<x<3 
1 x>3 


2.5.4 Let X ~ N(0, 1). Compute each of the following in terms of the function ® of 
Definition 2.5.2 and use Table D.2 (or software) to evaluate these probabilities numer- 
ically. 

(a) P(X < -5) 

(b) P(-2 < X <7) 

(c) P(X 2 3) 

2.5.5 Let Y ~ N(—8, 4). Compute each of the following, in terms of the function 
® of Definition 2.5.2 and use Table D.2 (or software) to evaluate these probabilities 
numerically. 

(a) PY < —5) 

(b) P2 < Y <7) 

(c) PY > 3) 


Chapter 2: Random Variables and Distributions 73 


2.5.6 Verify that the function G given by (2.5.3) satisfies properties (a) through (d) of 
Theorem 2.5.2. 

2.5.7 Suppose Fy(x) = x? forO<x <1. Compute each of the following. 

(a) P(X < 1/3) 

(b) PU /4 < X < 1/2) 

(c) P(2/5 < X < 4/5) 

(d) P(X <0) 

(e) P(X < 1) 

(f) P(X < =I) 

(g) P(X <3) 

h) P(X = 3/7) 

2.5.8 Suppose Fy(y) = y? for 0 < y < 1/2, and Fy(y) = 1 — (1 — y) for 1/2 < 
y < 1. Compute each of the following. 

(a) P(1/3 < Y < 3/4) 

(b) P(Y = 1/3) 

(c) PY = 1/2) 

2.5.9 Let F(x) = x? for 0 < x < 2, with F(x) = 0 for x < 0 and F(x) = 4 for 
x>2. 

(a) Sketch a graph of F. 

(b) Is F a valid cumulative distribution function? Why or why not? 

2.5.10 Let F(x) =0 for x < 0, with F(x) =e for x > 0. 

(a) Sketch a graph of F. 

(b) Is F a valid cumulative distribution function? Why or why not? 

2.5.11 Let F(x) =0 for x < 0, with F(x) = 1 — e™ for x > 0. 

(a) Sketch a graph of F. 

(b) Is F a valid cumulative distribution function? Why or why not? 

2.5.12 Let X ~ Exponential(3). Compute the function Fy. 

2.5.13 Let F(x) = 0 for x < 0, with F(x) = 1/3 forO < x < 2/5, and F(x) = 3/4 
for 2/5 <x < 4/5, and F(x) = 1 for x > 4/5. 

(a) Sketch a graph of F. 

(b) Prove that F is a valid cumulative distribution function. 

(c) If X has cumulative distribution function equal to F, then compute P(X > 4/5) 
and P(—1 < X < 1/2) and P(X = 2/5) and P(X = 4/5). 

2.5.14 Let G(x) = 0 for x < 0, with Gx) = 1 — e-* forx > 0. 

(a) Prove that G is a valid cumulative distribution function. 

(b) If Y has cumulative distribution function equal to G, then compute P(Y > 4) and 
P(-1 < Y <2)and P(Y = 0). 

2.5.15 Let F and G be as in the previous two exercises. Let H(x) = (1/3)F() + 
(2/3)G(x). Suppose Z has cumulative distribution function equal to H. Compute each 
of the following. 

(a) P(Z > 4/5) 

(b) P(-1 < Z < 1/2) 

(c) P(Z =2/5) 

(d) P(Z = 4/5) 


74 Section 2.6: One-Dimensional Change of Variable 


(e) P(Z =0) 
() P(Z = 1/2) 


PROBLEMS 


2.5.16 Let F be a cumulative distribution function. Compute (with explanation) the 
value of lim, o[F (2n) — F (n)]. 

2.5.17 Let F be a cumulative distribution function. For x € R!, we could define 
F(x*) by F(x) = limpo F(x + 1). Prove that F is right continuous, meaning that 
for each x € R!, we have F(x+) = F(x). (Hint: You will need to use continuity of P 
(Theorem 1.6.1).) 

2.5.18 Let X be a random variable, with cumulative distribution function Fy. Prove 
that P(X = a) = Oifand only if the function Fy is continuous at a. (Hint: Use (2.5.1) 
and the previous problem.) 


2.5.19 Let ® be as in Definition 2.5.2. Derive a formula for ®(—x) in terms of ®(x). 
(Hint: Let s = —t in (2.5.2), and do not forget Theorem 2.5.2.) 


2.5.20 Determine the distribution function for the logistic distribution of Problem 2.4.18. 
2.5.21 Determine the distribution function for the Weibull (a) distribution of Problem 


2.4.19. 

2.5.22 Determine the distribution function for the Pareto(a) distribution of Problem 
2.4.20. 

2.5.23 Determine the distribution function for the Cauchy distribution of Problem 
2.4.21. 

2.5.24 Determine the distribution function for the Laplace distribution of Problem 
2.4.22. 

2.5.25 Determine the distribution function for the extreme value distribution of Prob- 
lem 2.4.23. 

2.5.26 Determine the distribution function for the beta distributions of Problem 2.4.24 
for parts (b) through (e). 


DISCUSSION TOPICS 


2.5.27 Does it surprise you that all information about the distribution of a random 
variable X can be stored by a single function Fy? Why or why not? What other 
examples can you think of where lots of different information is stored by a single 
function? 


2.6 | One-Dimensional Change of Variable 


Let X be a random variable with a known distribution. Suppose that Y = A(X), where 
h : R! — R! is some function. (Recall that this really means that Y (s) = h(X(s)), for 
all s e S.) Then what is the distribution of Y? 


Chapter 2: Random Variables and Distributions 75 


2.6.1 | The Discrete Case 


If X is a discrete random variable, this is quite straightforward. To compute the proba- 
bility that Y = y, we need to compute the probability of the set consisting of all the x 
values satisfying A (x) = y, namely, compute P (X e {x : h(x) = y}). This is depicted 
graphically in Figure 2.6.1. 


T 


Xi X2 X3 y 


R! 


(xih) =y} = {Xp Xz x3} 
Figure 2.6.1: An example where the set of x values that satisfy h(x) = y consists of three 
points x1, X2, and x3. 


We now establish the basic result. 


Theorem 2.6.1 Let X be a discrete random variable, with probability function py. 
Let Y = h(X), where h : R! —> R! is some function. Then Y is also discrete, 


and its probability function py satisfies py (y) = >), <;-1 {y} P x(x), where h-!{y} 
is the set of all real numbers x with h(x) = y. 


PROOF | We compute that py(v) = P(A(X) = y) = Dreri p PX =x)= 
Diveh-!{y} PX (x), as claimed. E 


EXAMPLE 2.6.1 

Let X be the number of heads when flipping three fair coins. Let Y = 1 if X > 1, with 
Y = 0 if X = 0. Then Y = A(X), where A(0) = 0 and h(1) = h(2) = AGB) = 1. 
Hence, h~!{0} = {0}, so P(Y = 0) = P(X = 0) = 1/8. On the other hand, 
h-'{1} = {1,2,3}, so P(Y = 1) = P(X = 1)4+ P(X = 2) + P(X = 3) = 
3/8 +3/8+1/8 =7/8.0 


EXAMPLE 2.6.2 

Let X be the number showing on a fair six-sided die, so that P(X = x) = 1/6 for x = 
1,2,3,4,5, and 6. Let Y = X? —3X +2. Then Y = A(X), where A(x) = x? — 3x 42. 
Note that A(x) = 0 if and only if x = 1 or x = 2. Hence, h—!{0} = {1, 2} and 


1 


PUY =0) = px() + px@)=24+5=5-8 


1 
6 
2.6.2 | The Continuous Case 


If X is continuous and Y = h(X), then the situation is more complicated. Indeed, Y 
might not be continuous at all, as the following example shows. 


76 Section 2.6: One-Dimensional Change of Variable 


EXAMPLE 2.6.3 
Let X have the uniform distribution on [0, 1], i.e., X ~ Uniform[0, 1], as in Exam- 
ple 2.4.2. Let Y = h(X), where 


= 7 x <3/4 
me) = | 5 x > 3/4. 


Here, Y = 7 if and only if X < 3/4 (which happens with probability 3/4), whereas 
Y = 5 if and only if X > 3/4 (which happens with probability 1/4). Hence, Y is 
discrete, with probability function py satisfying py(7) = 3/4, py(5) = 1/4, and 
pyy) =0 when y £5,7.8 


On the other hand, if X is absolutely continuous, and the function A is strictly 
increasing, then the situation is considerably simpler, as the following theorem shows. 


Theorem 2.6.2 Let X be an absolutely continuous random variable, with density 
function fy. Let Y = h(X), where h : R! > R! is a function that is differen- 
tiable and strictly increasing. Then Y is also absolutely continuous, and its density 
function fy is given by 


FO) = fxh O) / ATOI, (2.6.1) 


where h’ is the derivative of h, and where h—!(y) is the unique number x such that 
A(x) =y. 


PROOF | See Section 2.11 for the proof of this result. E 


EXAMPLE 2.6.4 
Let X ~ Uniform[0, 1], and let Y = 3X. What is the distribution of Y? 

Here, X has density fy, given by fy(x) = 1if0 < x < 1, and fy(x) = 0 
otherwise. Also, Y = A(X), where h is defined by h(x) = 3x. Note that A is strictly 
increasing because if x < y, then 3x < 3y, i.e., h(x) < h(y). Hence, we may apply 
Theorem 2.6.2. 

We note first that h’(x) = 3 and that h“! (y) = y/3. Then, according to Theo- 
rem 2.6.2, Y is absolutely continuous with density 


1 
fr) = FOTOKA OI = zx003 
_ | 1/ 0<y/3<1 _ |143 0<y<3 
a 0 otherwise ~ 1 0 otherwise. 


By comparison with Example 2.4.3, we see that Y ~ Uniform[0, 3], i.e., that Y has 
the Uniform[Z, R] distribution with L = 0 and R = 3.48 


EXAMPLE 2.6.5 
Let X ~ N(0, 1), and let Y = 2X +5. What is the distribution of Y? 
Here, X has density fy, given by 


fils) = 6(x) = ae 


Chapter 2: Random Variables and Distributions 77 


Also, Y = A(X), where h is defined by A(x) = 2x + 5. Note that again, h is strictly 
increasing because if x < y, then 2x +5 < 2y + 5, i.e., h(x) < h(y). Hence, we may 
again apply Theorem 2.6.2. 

We note first that h'(x) = 2 and that h—'(y) = (y — 5)/2. Then, according to 
Theorem 2.6.2, Y is absolutely continuous with density 


fr0) = fr OKE O= fel — 5)/2)/2 = WE eH OP, 


By comparison with Example 2.4.8, we see that Y ~ N(5, 4), i.e., that Y has the 
N (u, o°) distribution with u = 5 and o? = 4.1 


If instead the function A is strictly decreasing, then a similar result holds. 


Theorem 2.6.3 Let X be an absolutely continuous random variable, with density 
function fy. Let Y = A(X), where h : R! > R! is a function that is differen- 


tiable and strictly decreasing. Then Y is also absolutely continuous, and its density 
function fy may again be defined by (2.6.1). 


PROOF | See Section 2.11 for the proof of this result. E 


EXAMPLE 2.6.6 
Let X ~ Uniform[0, 1], and let Y = In(1/X). What is the distribution of Y? 

Here, X has density fy, given by fy(x) = 1 for0 < x < 1, and fy(~) = 0 
otherwise. Also, Y = A(X), where h is defined by h(x) = In(1/x). Note that here, 
h is strictly decreasing because if x < y, then 1/x > 1/y, so In(1/x) > In(1/y), i.e., 
h(x) > h(y). Hence, we may apply Theorem 2.6.3. 

We note first that h'(x) = —1/x and that A! (y) = e~”. Then, by Theorem 2.6.3, 
Y is absolutely continuous with density 


fry) = fxh'O))/W'G0))| =e fre) 
= ey O<e*<1 Je” y>0 
= 0 otherwise ~ |) 0 otherwise. 


By comparison with Example 2.4.4, we see that Y ~ Exponential(1), i.e., that Y has 
the Exponential(1) distribution. E 


Finally, we note the following. 


Theorem 2.6.4 Theorem 2.6.2 (and 2.6.3) remains true assuming only that h is 
strictly increasing (or decreasing) at places for which fy(x) > 0. If fx(x) = 0 for 


an interval of x values, then it does not matter how the function A behaves in that 
interval (or even if it is well defined there). 


EXAMPLE 2.6.7 

If X ~ Exponential(A), then f(x) = 0 for x < 0. Therefore, it is required that h be 
strictly increasing (or decreasing) only for x > 0. Thus, functions such as h(x) = x?, 
h(x) = x8, and h(x) = ./x could still be used with Theorem 2.6.2, while functions 
such as h(x) = —x?, h(x) = —x8, and h(x) = —/*x could still be used with The- 
orem 2.6.3, even though such functions may not necessarily be strictly increasing (or 
decreasing) and well defined on the entire real line. E 


78 Section 2.6: One-Dimensional Change of Variable 


Summary of Section 2.6 


e If X is discrete, and Y = A(X), then P(Y = y) = Diy: r= P(X = x). 

e If X is absolutely continuous, and Y = A(X) with A strictly increasing or strictly 
decreasing, then the density of Y is given by fy(v) = fy(h7!(v)) /|h'(h7!Q))]. 

e This allows us to compute the distribution of a function of a random variable. 


EXERCISES 


2.6.1 Let X ~ Uniform[Z, R]. Let Y = cX +d, where c > 0. Prove that Y ~ 
Uniform[cL + d,cR + d]. (This generalizes Example 2.6.4.) 

2.6.2 Let X ~ Uniform[L, R]. Let Y = cX +d, where c < 0. Prove that Y ~ 
Uniform[cR + d, cL + d]. (In particular, if L = 0 and R = 1 and c = —1 and d = 1, 
then X ~ Uniform[0, 1] and also Y = 1 — X ~ Uniform[0, 1].) 

2.6.3 Let X ~ N(u,07). Let Y = cX + d, where c > 0. Prove that Y ~ N(cu + 
d,c?o7). (This generalizes Example 2.6.5.) 

2.6.4 Let X ~ Exponential(/). Let Y = cX, where c > 0. Prove that Y ~ 
Exponential(/ /c). 

2.6.5 Let X ~ Exponential(A). Let Y = X?. Compute the density fy of Y. 

2.6.6 Let X ~ Exponential(A). Let Y = X!/4. Compute the density fy of Y. (Hint: 
Use Theorem 2.6.4.) 

2.6.7 Let X ~ Uniform[0, 3]. Let Y = X2. Compute the density function fy of Y. 
2.6.8 Let X have a density such that fy (u +x) = fx (u — x), i.e., it is symmetric 
about u. Let Y = 2u — X. Show that the density of Y is given by fy. Use this to 
determine the distribution of Y when X ~ N (u, o°). 

2.6.9 Let X have density function fy(x) = x?/4for0 < x < 2, otherwise fy(x) = 0. 
(a) Let Y = X?. Compute the density function fy (y) for Y. 

(b) Let Z = VX. Compute the density function fz(z) for Z. 

2.6.10 Let X ~ Uniform[0, æ /2]. Let Y = sin(X). Compute the density function 
fyo) for Y. 

2.6.11 Let X have density function fy(x) = (1/2)sin(@x) for 0 < x < m, otherwise 
fx(x) = 0. Let Y = X*. Compute the density function fy (y) for Y. 

2.6.12 Let X have density function fy(x) = 1/x? for x > 1, otherwise fy(x) = 0. 
Let Y = X13, Compute the density function fy(y) for Y. 

2.6.13 Let X ~ Normal(0, 1). Let Y = X3. Compute the density function fy(y) for 
Y: 


PROBLEMS 


2.6.14 Let X ~ Uniform[2, 7], Y = X, and Z = VY. Compute the density fz of Z, 
in two ways. 

(a) Apply Theorem 2.6.2 to first obtain the density of Y, then apply Theorem 2.6.2 
again to obtain the density of Z. 

(b) Observe that Z = VY = VX? = X?/2, and apply Theorem 2.6.2 just once. 


Chapter 2: Random Variables and Distributions 79 


2.6.15 Let X ~ Uniform[L, R], and let Y = A(X) where h(x) = (x — ¢)f. According 
to Theorem 2.6.4, under what conditions on L, R, and c can we apply Theorem 2.6.2 
or Theorem 2.6.3 to this choice of X and Y? 
2.6.16 Let X ~ N(u, o°). Let Y = cX +d, where c < 0. Prove that again Y ~ 
N(cu + d, c?o?), just like in Exercise 2.6.3. 
2.6.17 (Log-normal(t) distribution) Suppose that X ~ N (0, 17). Prove that Y = e* 


has density 
1 mny \1 
fo)= re o0(-S2 ): 


for y > 0 and where t > 0 is unknown. We say that Y ~ Log-normal(r). 

2.6.18 Suppose that X ~ Weibull (a) (see Problem 2.4.19). Determine the distribution 
of Y = Oe, 

2.6.19 Suppose that X ~ Pareto(a) (see Problem 2.4.20). Determine the distribution 
of Y= (1+ Xf -1. 

2.6.20 Suppose that X has the extreme value distribution (see Problem 2.4.23). Deter- 
mine the distribution of Y = e~*. 


CHALLENGES 


2.6.21 Theorems 2.6.2 and 2.6.3 require that h be an increasing or decreasing function, 
at least at places where the density of X is positive (see Theorem 2.6.4). Suppose now 
that X ~ N(0, 1) and Y = A(X), where h(x) = x”. Then fy(x) > 0 for all x, while 
h is increasing only for x > 0 and decreasing only for x < 0. Hence, Theorems 2.6.2 
and 2.6.3 do not directly apply. Compute fy(y) anyway. (Hint: P(a < Y < b) = 
Pa<Y <b, X>0)+P@<Y<b, X <0).) 


2.7 | Joint Distributions 


Suppose X and Y are two random variables. Even if we know the distributions of X 
and Y exactly, this still does not tell us anything about the relationship between X and 
Ys 


EXAMPLE 2.7.1 

Let X ~ Bernoulli(1/2), so that P(X = 0) = P(X = 1) = 1/2. Let Yı = X, and let 
Y2 = 1 — X. Then we clearly have Yı ~ Bernoulli(1/2) and Y2 ~ Bernoulli(1 /2) as 
well. 

On the other hand, the relationship between X and Y, is very different from the re- 
lationship between X and Y2. For example, if we know that X = 1, then we also must 
have Yı = 1, but Y2 = 0. Hence, merely knowing that X, Yı, and Y2 all have the dis- 
tribution Bernoulli(1/2) does not give us complete information about the relationships 
among these random variables. H 


A formal definition of joint distribution is as follows. 


80 Section 2.7: Joint Distributions 


Definition 2.7.1 If X and Y are random variables, then the joint distribution of X 
and Y is the collection of probabilities P((X, Y) € B), for all subsets B C R? of 


pairs of real numbers. 


Joint distributions, like other distributions, are so complicated that we use vari- 
ous functions to describe them, including joint cumulative distribution functions, joint 
probability functions, and joint density functions, as we now discuss. 


2.7.1 | Joint Cumulative Distribution Functions 


Definition 2.7.2 Let X and Y be random variables. Then their joint cumulative 
distribution function is the function Fy y : R? > [0, 1] defined by 


Fyy(x,y)= P(X <x, Y<y). 


(Recall that the comma means “and” here, so that Fy y (x, y) is the probability that 
X <xandY <v.) 


EXAMPLE 2.7.2 (Example 2.7.1 continued) 
Again, let X ~ Bernoulli(1/2), Yı = X, and Y2 = 1 — X. Then we compute that 


0 min(x, y) <0 
Fy, n &,y)= P(X <x, Yı <y)=}\ 1/2 0 < min&, y) <1 
1 min(x, y) > 1. 
On the other hand, 
0 min(x, y) < 0 or max(x, y) < 1 
Fx n (x,y) = P(X <x,» <y)={ 1/2 0 < min(x, y) < 1 < max(x, y) 
1 min(x, y) > 1. 


We thus see that Fy, y; is quite a different function from Fy,y,. This reflects the 
fact that, even though Yı and Y each have the same distribution, their relationship 
with X is quite different. On the other hand, the functions Fy y, and Fy,y, are rather 
cumbersome and awkward to work with. E 


We see from this example that joint cumulative distribution functions (or joint cdfs) 
do indeed keep track of the relationship between X and Y. Indeed, joint cdfs tell us 
everything about the joint probabilities of X and Y, as the following theorem (an analog 
of Theorem 2.5.1) shows. 


Theorem 2.7.1 Let X and Y be any random variables, with joint cumulative dis- 
tribution function Fy y. Let B be a subset of R2. Then P((X, Y) € B) can be 


determined solely from the values of Fy y (x, y). 


We shall not give a proof of Theorem 2.7.1, although it is similar to the proof of 
Theorem 2.5.1. However, the following theorem indicates why Theorem 2.7.1 is true, 
and it also provides a useful computational fact. 


Chapter 2: Random Variables and Distributions 81 


Theorem 2.7.2 Let X and Y be any random variables, with joint cumulative distri- 
bution function Fy y. Suppose a < b and c < d. Then 


P(a <X <b, c <Y <d)= Fy,y(b,d)— Fy,y(a,d)— Fy,y(b,c)+FYy,y (a, c). 


PROOF | According to (1.3.3), 


Pa<X<b,c<Y<d) 
= P(X <b, Y <d)— P(X <b, Y < d,andeither X <aorY <c). 


But by the principle of inclusion—exclusion (1.3.4), 


P(X <b, Y <d, andeither X < aorY <c) 
= P(X <b, Y<c)+P(X <a, Y <d)— P(X <a, Y <c). 


Combining these two equations, we see that 

P(a<X<b,c<Y<d) 

= P(X <b,Y <d)— P(X <a,Y <d)— P(X <b,Y <c)+P(X<a,Y <c) 
and from this we obtain 

P(a < X <b,c <Y <d)= Fy,y(b,d)-— Fyx,y(a,d)— Fy y(b,c) + Fx,y (a,c), 
as claimed. E 


Joint cdfs are not easy to work with. Thus, in this section we shall also consider 
other functions, which are more convenient for pairs of discrete or absolutely continu- 
ous random variables. 


2.7.2 | Marginal Distributions 


We have seen how a joint cumulative distribution function Fy,y tells us about the rela- 
tionship between X and Y. However, the function Fy y also tells us everything about 
each of X and Y separately, as the following theorem shows. 


Theorem 2.7.3 Let X and Y be two random variables, with joint cumulative distri- 
bution function Fy y. Then the cumulative distribution function Fy of X satisfies 


Fx(x) = qe Fy, y(x, y), 


for all x € R!. Similarly, the cumulative distribution function Fy of Y satisfies 
Fy(y) = lim Fy,y@,y), 
Xx—0O 


for all y e R!. 


82 Section 2.7: Joint Distributions 


PROOF | Note that we always have Y < oo. Hence, using continuity of P, we have 


Fy) = P(X <x) 
= P(X <x, Y < ow) 
= lim P(X <x, Y<y) 
yoo 


= Ji er); 


as claimed. Similarly, 


Fy) = PY sy) 
= P(X <ow, Y <y) 
= lim P(X <x, Y<y) 
x7 00 


= Jim, Fy y(x,y), 


completing the proof. E 


In the context of Theorem 2.7.3, Fy is called the marginal cumulative distribu- 
tion function of X, and the distribution of X is called the marginal distribution of X. 
(Similarly, Fy is called the marginal cumulative distribution function of Y, and the 
distribution of Y is called the marginal distribution of Y.) Intuitively, if we think of 
Fy,y as being a function of a pair (x, y), then Fy and Fy are functions of x and y, 
respectively, which could be written into the “margins” of a graph of Fy,y. 


EXAMPLE 2.7.3 
In Figure 2.7.1, we have plotted the joint distribution function 


0 x <Oory <0 

xy? 0<x<10<y<l 
Fyy@,y)= 4% O0<x<l,y>1 

y x>1,0<y<1 

1 x> landy> l1. 


It is easy to see that 

Fy x) = Fy, y x, 1) =x 
for0 < x < 1 and that 

Fy O) = Fry (l,y) =y? 


for 0 < y < 1. The graphs of these functions are given by the outermost edges of the 
surface depicted in Figure 2.7.1. 


Chapter 2: Random Variables and Distributions 83 


Figure 2.7.1: Graph of the joint distribution function Fy, y (x, y) = xy? for0 <x < land 
0 < y < 1 in Example 2.7.3. 


Theorem 2.7.3 thus tells us that the joint cdf Fy y is very useful indeed. Not only 
does it tell us about the relationship of X to Y, but it also contains all the information 
about the marginal distributions of X and of Y. 

We will see in the next subsections that joint probability functions, and joint density 
functions, similarly contain information about both the relationship of X and Y and the 
marginal distributions of X and Y. 


2.7.3 | Joint Probability Functions 


Suppose X and Y are both discrete random variables. Then we can define a joint 
probability function for X and Y, as follows. 


Definition 2.7.3 Let X and Y be discrete random variables. Then their joint prob- 
ability function, px.y, is a function from R to R!, defined by 


Pxy@,y)= P(X =x, Y =y). 


Consider the following example. 


84 


Section 2.7: Joint Distributions 


EXAMPLE 2.7.4 (Examples 2.7.1 and 2.7.2 continued) 
Again, let X ~ Bernoulli(1/2), Yı = X, and Y2 = 1 — X. Then we see that 


px, œ, y) = P(X =x, MN =y)= 


On the other hand, 


pPx,n&œ, y) = PX =x, h =y)= 


1/2 x=y=l 

1/2 x=y=0 

0 otherwise. 
1/2 x=1, y=0 
1/2 x=0, y=1 
0 otherwise. 


We thus see that py,y, and px,y, are two simple functions that are easy to work 
with and that clearly describe the relationships between X and Yı and between X and 
Y2. Hence, for pairs of discrete random variables, joint probability functions are usually 


the best way to describe their relationships. E 


Once we know the joint probability function py y, the marginal probability func- 


tions of X and Y are easily obtained. 


Theorem 2.7.4 Let X and Y be two discrete random variables, with joint probabil- 
ity function py,y. Then the probability function py of X can be computed as 


px@) => pxy(,y). 
Y 


Similarly, the probability function py of Y can be computed as 


prO) = >) pxy&, y). 


PROOF | Using additivity of P, we have that 


pr@)=P(X=x)=) RAY ES EI 
y y 


as claimed. Similarly, 


prO) = PY =y)= > PO =x, Y =y)=)_ pxy@,y).0 


EXAMPLE 2.7.5 


Suppose the joint probability function of X and Y is given by 


1/7 
1/7 

px, „yœ, y) = 
1/7 
0 


x=5,y=0 
KH=9; 7 S35 
x=s5,y=4 
x=8,y=0 
x=8,y=4 
otherwise. 


Chapter 2: Random Variables and Distributions 85 


Then 
px) = > px, x6, y) = px,y (5,0) + px,y (5,3) + px,y 6,4) 
y 
7 ine L. 3 
SRE Te. ele 
while 
8) =) pxy8,») = pxy@,0) + pxy@,4=24+2=2 
px) = ARS sY) = PX,Y \%, PX,Y, a a 
Similarly, 
(4) = $ pxy@,4) = pxy6,4) + Giap 
py(4) = ny »4) = px,y (5, PXYB,N =F +> = 7, 
etc. 


Note that in such a simple context it is possible to tabulate the joint probability 
function in a table, as illustrated below for py y, px, and py of this example. 


Y=0 Y=3 Y=4 


1/7 1/7 1/7 | 3/7 
3/7 0 1/7 | 4/7 


4/7 1/7 2/77 | | 


Summing the rows and columns and placing the totals in the margins gives the marginal 
distributions of X and Y. E 


2.7.4 | Joint Density Functions 


If X and Y are continuous random variables, then clearly p x,y (x, y) = 0 for all x and 
y. Hence, joint probability functions are not useful in this case. On the other hand, we 
shall see here that if X and Y are jointly absolutely continuous, then their relationship 
may be usefully described by a joint density function. 


Definition 2.7.4 Let f : R? > R! bea function. Then f is a joint density function 
if f(x,y) > 0 for all x and y, and [f° f@, y)dx dy =1. 


Definition 2.7.5 Let X and Y be random variables. Then X and Y are jointly ab- 
solutely continuous if there is a joint density function f, such that 


d pb 
PasX<b,c<ved=[ | féy)dxdy, 


forala <b,c<d. 


86 Section 2.7: Joint Distributions 


Consider the following example. 
EXAMPLE 2.7.6 


Let X and Y be jointly absolutely continuous, with joint density function f given by 


| 4x7y +2y° G<x 41,027 <1 
FO) =| 0 otherwise. 


We first verify that f is indeed a density function. Clearly, f(x,y) > 0 for all x 
and y. Also, 


ff tenaray 


ll 
o— 
5 

Pa 

iN 

os 

N 

< 

+ 

N 
< 

Nn 

— 
a 
=< 

© 
II 
o— 
I 
=< 
+ 
N 
< 
Nn 
es 
$ 


= --+2-=-+-=1 


Hence, f is a joint density function. In Figure 2.7.2, we have plotted the function f, 
which gives a surface over the unit square. 


ZA 
LALLA 


Figure 2.7.2: A plot of the density f in Example 2.7.6. 


Chapter 2: Random Variables and Distributions 87 


We next compute P(0.5 < X < 0.7, 0.2 < Y < 0.9). Indeed, we have 


POS < X < 0.7, 0.2 < Y <0.9) 
0.9 70.7 


S (4x?y + 2y°) dx dy 
0.2 J0.5 


097 (4 3 3 5 
= fe (Gon — (0.5) E + 2y°(0.7 -05)| dy 
_ 4 ZA 3 1 Qos. 2 2 6 6 a 
=>; (0.7) (0.5) ) 509 (0.2)°) + £((0.9)° — 0.26)0.7 — 0.5) 


= (0.7) — (0.5))((0.9)* — (0.22) + O — (0.2)°)(0.7 — 0.5) = 0.147. 


Other probabilities can be computed similarly. E 


Once we know a joint density fy,y, then computing the marginal densities of X 
and Y is very easy, as the following theorem shows. 


Theorem 2.7.5 Let X and Y be jointly absolutely continuous random variables, 
with joint density function fy y. Then the (marginal) density fx of X satisfies 


o Ht far@, x)ay, 


for all x e R!. Similarly, the (marginal) density fy of Y satisfies 


fro) =| fare, yde, 


forall y e R!. 


PROOF| We need to show that, fora < b, P(a < X 
is Jo fur, »)dy dx. Now, we always have —co < Y 
tinuity of P, we have that P(a < X < b) = Pla < X 
and 


< b) = f fx(x)dx = 


Pa<X <b,-w<Y<oo) 


d pb 
lim PasXsbes¥sd)= jim | f f@yjdxdy 


d>oo d—> o0 
b pd b ro 

= im | | re.nayas= fof fxr@y)dydx, 
d= a Cc a —00 


as claimed. The result for fy follows similarly. E 


EXAMPLE 2.7.7 (Example 2.7.6 continued) 
Let X and Y again have joint density 


B 4x? y 4+ 2y5 0<x<1,0<y<l 
fxy@,y) = | 0 otherwise. 


88 Section 2.7: Joint Distributions 
Then by Theorem 2.7.5, for0 < x < 1, 


foe) 1 
fe) = | fxr, y)dy = | (4x2y 4.2y)dy = 2x? + (1/3), 


while forx <Oorx > 1, 


o0 


ro=f frre. »dy = | Ody =0. 


Similarly, for 0 < y < 1, 


foe) 1 
4 
fro) = [ frend = f (x?y +2y dx = $y +275, 
—00 


while for y < Oor y > 1, fy) =0.5 


EXAMPLE 2.7.8 
Suppose X and Y are jointly absolutely continuous, with joint density 


f 120x3y x>0,y>0,x+y<1 
fxy@,y) = | 0 otherwise. 


Then the region where fy,y(x, y) > 0 is a triangle, as depicted in Figure 2.7.3. 


y A 


ex 


1 
Figure 2.7.3: Region of the plane where the density fy y in Example 2.7.8 is positive. 


We check that 


Sf] Iranda 


1 pl—x 1 2 
_ 
7 f 120x*y dy dx > 120x 2) dx 
o Jo 0 2 


1 
b ae 
f 60(x3 — 2x4 + x7) dx = 60(5 -23 +2) 


15 —2(12) + 10 = 1, 
so that fy y is indeed a joint density function. We then compute that, for example, 


a 3 3(1—x) 3 4,5 
pus) = | 120x°y dy = 120x a = = 60(x° — 2x" +x”) 


Chapter 2: Random Variables and Distributions 89 


for0 <x < 1 (with fy(x) =0 forx <Oorx > 1).U 


EXAMPLE 2.7.9 Bivariate Normal(u;, 2,01, 02, p) Distribution 
Let 41, #2, 01, 02, and p be real numbers, with 0},02 > 0 and —1 < p < 1. Let X 
and Y have joint density given by 


XH 2 YrH2 5 
1 o] + 02 -= 


1 
try, y) SOOO Xp 7» — 
220 102,/1 — p? 2(1 — p?) 2p (54) (54) 
o1 


02 


forx € R!, y e R!. We say that X and Y have the Bivariate Normal(u;, 42,61, 02, p) 
distribution. 

It can be shown (see Problem 2.7.13) that X ~ N (u1, a?) and Y ~ N(u2, a). 
Hence, X and Y are each normally distributed. The parameter p measures the degree 
of the relationship that exists between X and Y (see Problem 3.3.17) and is called 
the correlation. In particular, X and Y are independent (see Section 2.8.3), and so 
unrelated, if and only if p = 0 (see Problem 2.8.21). 

Figure 2.7.4 is a plot of the standard bivariate normal density, given by setting 
My = 0, 42 = 0,0; = 1,o2 = 1, and p = 0. This is a bell-shaped surface in R3 
with its peak at the point (0,0) in the xy-plane. The graph of the general Bivariate 
Normal(1, 42,01, 62, p) distribution is also a bell-shaped surface, but the peak is at 
the point (u1, 42) in the xy-plane and the shape of the bell is controlled by 01, 02, and 


p. 


Figure 2.7.4: A plot of the standard bivariate normal density function. 


90 Section 2.7: Joint Distributions 


It can be shown (see Problem 2.9.16) that, when Z1, Z2 are independent random 
variables, both distributed N (0, 1), and we put 


Rey Meas Faye es (621 +(1 - p°)'?29), (2.7.1) 


then (X, Y) ~ Bivariate Normal(1, 42,01,02, p). This relationship can be quite 
useful in establishing various properties of this distribution. We can also write an 
analogous version Y = u2 +02Z1,X = uı +01(pZ, + (1 — pe Z2) and obtain 
the same distributional result. 

The bivariate normal distribution is one of the most commonly used bivariate dis- 
tributions in applications. For example, if we randomly select an individual from a 
population and measure his weight X and height Y, then a bivariate normal distribution 
will often provide a reasonable description of the joint distribution of these variables. E 


Joint densities can also be used to compute probabilities of more general regions, 
as the following result shows. (We omit the proof. The special case B = [a, b] x [c, d] 
corresponds directly to the definition of fy y.) 


Theorem 2.7.6 Let X and Y be jointly absolutely continuous random variables, 
with joint density fy y, and let B C R? be any region. Then 


P(Æ, Y) eB) = | | f@y)dxdy. 


The previous discussion has centered around having just two random variables, 


X and Y. More generally, we may consider n random variables X1, ..., Xn. If the 
random variables are all discrete, then we can further define a joint probability function 
PXI, Xn © R” — [0, 1] by Px, X, 1; Xn) = PXI = x1, ..., Xn = Xn). 


If the random variables are jointly absolutely continuous, then we can define a joint 
density function fy,,....v, : R” — [0, 1] so that 


PQ, < Xı < bi, ...,an < Xn < bn) 


bn bi 
=) Of” Prints E dar icedan, 
an a 


whenever a; < b; for all i. 


Summary of Section 2.7 


e It is often important to keep track of the joint probabilities of two random vari- 

ables, X and Y. 

Their joint cumulative distribution function is given by Fy y(x,y) = P(X < 

x, Y<y). 

e If X and Y are discrete, then their joint probability function is given by py y(x, y) 
=P(X=x, Y=y). 


Chapter 2: Random Variables and Distributions 91 


e If X and Y are absolutely continuous, then their joint density function fy y œŒ, y) 
is such that Pa < X <b, c < Y < d) = fÊ f? fxy@,y)dx dy. 

e The marginal density of X and Y can be computed from any of Fy y, or px,y, 
or fx,y. 

e Animportant example of a joint distribution is the bivariate normal distribution. 


EXERCISES 


2.7.1 Let X ~ Bernoulli(1/3), and let Y = 4X — 2. Compute the joint cdf Fy,y. 
2.7.2 Let X ~ Bernoulli(1/4), and let Y = —7X. Compute the joint cdf Fy y. 
2.7.3 Suppose 


1/5 x=2, y=3 
1/5 x=3,y=2 
J15 x = —3, y = -2 
px,y&,y)= 1/5 x = —2, y = —3 
1/5 x=17, y=19 
0 otherwise. 
(a) Compute py. 
(b) Compute py. 


(c) Compute P(Y > X). 

(d) Compute P(Y = X). 

(e) Compute P (XY < 0). 

2.7.4 For each of the following joint density functions fy y, find the value of C and 
compute fx(x), fy), and P(X < 0.8, Y < 0.6). 


a 
ji 53,9 
fre = | ae en 
(c) < 
r 3.459. 


2.7.5 Prove that Fy y(x, y) < min(F'y(x), Fy(y)). 
2.7.6 Suppose P(X = x, Y = y) = 1/8 for x = 3,5 and y = 1, 2,4, 7, otherwise 
P(X =x, Y =y) =0. Compute each of the following. 


92 Section 2.7: Joint Distributions 


(a) Fx,y (x, y) forall x, y € R! 

(b) px.y(x, y) forall x,y € R! 

(c) px(x) forall x e R! 

(d) py (y) forall x €e R! 

(e) The marginal cdf Fy(x) for all x € R! 

(f) The marginal cdf Fy (y) for all y € R! 

2.7.7 Let X and Y have joint density fx, y(x, y) = c sin(xy) for 0 < x < 1 and 
0 < y < 2, otherwise fy, y(x, y) = 0, for appropriate constant c > 0 (which cannot 
be computed explicitly). In terms of c, compute each of the following. 

(a) The marginal density fy(x) for all x € R! 

(b) The marginal density fy (y) for all y € R! 

2.7.8 Let X and Y have joint density fy, y x,y) = (x? + y)/36 for —2 < x < land 
0 < y <4, otherwise fy y(x, y) = 0. Compute each of the following. 

(a) The marginal density fy(x) for all x € R! 

(b) The marginal density fy(y) for all y e R! 

(c) PY <1) 

(d) The joint cdf Fy, yx, y) for all x, y € R! 

2.7.9 Let X and Y have joint density fy yœ, y) = œ? +y) 4 for 0.2 & <y <2, 
otherwise fy y(x, y) = 0. Compute each of the following. 

(a) The marginal density f(x) for all x € R! 

(b) The marginal density fy(y) for all y e R! 

(c) PY <1) 

2.7.10 Let X and Y have the Bivariate-Normal(3, 5, 2, 4, 1/2) distribution. 

(a) Specify the marginal distribution of X. 

(b) Specify the marginal distribution of Y. 

(c) Are X and Y independent? Why or why not? 


PROBLEMS 


2.7.11 Let X ~ Exponential(A), and let Y = X3. Compute the joint cdf, Fy,y(x, y). 
2.7.12 Let Fy,y be a joint cdf. Prove that for all y € R! lim-o% Fy y@,y) =0. 
2.7.13 Let X and Y have the Bivariate Normal(u1, “2,01, 02, p) distribution, as in 
Example 2.7.9. Prove that X ~ N(u1y, 07), by proving that 


oo wf 2 
| tarevay = —= on | S55" 


2.7.14 Suppose that the joint density fy y is given by fx,y œ, y) = Cye™™” for 0 < 
x <1,0 < y < 1 and is 0 otherwise. 


(a) Determine C so that fy,y is a density. 
(b) Compute P (1/2 < X < 1,1/2 < Y <1). 
(c) Compute the marginal densities of X and Y. 


2.7.15 Suppose that the joint density fy y is given by fx, y(x, y) = Cye™™? for 0 < 
x < y < landis 0 otherwise. 


Chapter 2: Random Variables and Distributions 93 


(a) Determine C so that fy y is a density. 

(b) Compute P (1/2 < X < 1,1/2 < Y <1). 

(c) Compute the marginal densities of X and Y. 

2.7.16 Suppose that the joint density fx,y is given by fy,y œ, y) = CeT) for 
0<x < y < œ and is 0 otherwise. 

(a) Determine C so that fy,y is a density. 

(b) Compute the marginal densities of X and Y. 

2.7.17 (Dirichlet(a1, a2, a3) distribution) Let (X1, X2) have the joint density 


Tr = = 
(a1 +a2 +43) E a yyy — x9) 
P(@1)T (@2)T (a3) 


for xı > 0, x2 > 0, and0 < xı +x2 < 1. A Dirichlet distribution is often applicable 
when X1, X2, and 1 — X1 — X2 correspond to random proportions. 


di 
fxi, X (1, X2) = re 


(a) Prove that fy, x, is a density. (Hint: Sketch the region where fy,,x, is nonnegative, 
integrate out x; first by making the transformation u = x;/(1 — x2) in this integral, and 
use (2.4.10) from Problem 2.4.24.) 


(b) Prove that X; ~ Beta(a,, a2 + a3) and X2 ~ Beta(az, a, +43). 
2.7.18 (Dirichlet(a,, ..., 41) distribution) Let (X1,..., Xj) have the joint density 


Ax 1500.,X— 1s <--> Xk) 
= Tr (a1 +e + eH) a f 
T (a1) T (agi)! 


for x; > 0,i = 1,..., k, and 0 < xj +-+- +x < 1. Prove that fyx,,...,x, is a density. 
(Hint: Problem 2.7.17.) 


CHALLENGES 


2.7.19 Find an example of two random variables X and Y and a function A : R! > R!, 
such that Fy(x) > 0 and Fy (x) > 0 for all x € R!, but limy oo Fy yx, hx)) =0. 


DISCUSSION TOPICS 


2.7.20 What are examples of pairs of real-life random quantities that have interesting 
relationships? (List as many as you can, and describe each relationship as well as you 
can.) 


ee Wea ia xT! 


2.8 | Conditioning and Independence 


Let X and Y be two random variables. Suppose we know that X = 5. What does 
that tell us about Y? Depending on the relationship between X and Y, that may tell 
us everything about Y (e.g., if Y = X), or nothing about Y. Usually, the answer 
will be between these two extremes, and the knowledge that X = 5 will change the 
probabilities for Y somewhat. 


94 Section 2.8: Conditioning and Independence 


2.8.1 | Conditioning on Discrete Random Variables 


Suppose X is a discrete random variable, with P(X = 5) > 0. Let a < b, and suppose 
we are interested in the conditional probability P(a < Y < b|X = 5). Well, we 
already know how to compute such conditional probabilities. Indeed, by (1.5.1), 


P(a<Y <b, X=5) 


P(a <Y<b|X=5)= pas 


provided that P (X = 5) > 0. This prompts the following definition. 


Definition 2.8.1 Let X and Y be random variables, and suppose that P(X = x) > 
0. The conditional distribution of Y, given that X = x, is the probability distribution 


assigning probability 
PYeEB,X=x) 


P(X =x) 
to each event Y €e B. In particular, it assigns probability 


IP@ IW Spy, NC = 3s) 
P(X =x) 


to the event thata < Y < b. 


EXAMPLE 2.8.1 
Suppose as in Example 2.7.5 that X and Y have joint probability function 
1/7 x=5,y=0 
1/7 x=5,y=3 
= 1/7 Y= 5p SA 
Pxy,y) = 3/7 x=8, y=0 
1/7 x=8, y=4 
0 otherwise. 
We compute P(Y = 4 | X = 8) as 
P(Y =4, X= 1 1 
Fo e e cca L 
P(X =8) (3/7) +(1/7) 4/7 
On the other hand, 
P(Y =4, X=5 1/7 1/7 
PO Say Sey Se se 
P(X =5) (1/7) +C/7)+ 0/7) 3/7 


Thus, depending on the value of X, we obtain different probabilities for Y. E 
Generalizing from the above example, we see that if X and Y are discrete, then 


PY=y,X=x)_ pxy(t,y) px.) 


PY=y|X=x)= P(X =x) ~ px(x) >, pxy&, z) 


Chapter 2: Random Variables and Distributions 95 


This prompts the following definition. 


Definition 2.8.2 Suppose X and Y are two discrete random variables. Then the 
conditional probability function of Y, given X, is the function py x defined by 


pxy@,y) _ px,y(,y) 


AES ree = are 


defined for all y € R! and all x with py(x) > 0. 


2.8.2 | Conditioning on Continuous Random Variables 


If X is continuous, then we will have P(X = x) = 0. In this case, Definitions 2.8.1 
and 2.8.2 cannot be used because we cannot divide by 0. So how can we condition on 
X =x in this case? 

One approach is suggested by instead conditioning on x — € < X < x + €, where 
€ > 0 isa very small number. Even if X is continuous, we might still have P(x — € < 
X <x +e) > 0. On the other hand, if € is very small and x — € < X <x +e, then X 
must be very close to x. 

Indeed, suppose that X and Y are jointly absolutely continuous, with joint density 
function fy y. Then 


Pa<Y¥<b,x-€<X<x+e) 
Pix-e<X<xte) 
b 
[20 KE fart, y) dt dy 


In Figure 2.8.1, we have plotted the region {(x, y):a<y<b,x-€ <x <x+e} 
for (X, Y). 


Pla<Y¥<b|x-e€<X<x+e) 


x 


X-E xte 


Figure 2.8.1: The shaded region is the set {(x, y):a <y <b, x—e€<x<x+e}. 


Now, if € is very small, then in the above integrals we will always have ¢ very close 
to x. If fy,y is a continuous function, then this implies that fy y(¢, y) will be very 


96 Section 2.8: Conditioning and Independence 


close to fy y(x, y). We conclude that, if € is very small, then 


SSe REE fy (x, y) dt dy 


fe fer@.ydy Ja [S fy, z)dz 


Pla<Y¥<b|x-e€<X<x+e)* 


This suggests that the quantity 


fx,y@.y)  _ fx,y@, y) 


fren fx yŒ, 2) dz fx(x) 


plays the role of a density, for the conditional distribution of Y, given that X = x. This 
prompts the following definitions. 


Definition 2.8.3 Let X and Y be jointly absolutely continuous, with joint den- 
sity function fy y. The conditional density of Y, given X = x, is the function 
fyix®O |x), defined by 
fx,y , y) 

fx)” 


valid for all y e R!, and for all x such that fy(x) > 0. 


fyxO |x) = 


Definition 2.8.4 Let X and Y be jointly absolutely continuous, with joint density 
function fy y. The conditional distribution of Y, given X = x, is defined by saying 
that 


b 
Pa<Y To oles 


when a < b, with fy,x as in Definition 2.8.3, valid for all x such that fx(x) > 0. 


EXAMPLE 2.8.2 
Let X and Y have joint density 
ao 4x?y + 2y" 0<x<1,0<y<1 
Jar y= | 0 otherwise, 


as in Examples 2.7.6 and 2.7.7. 
We know from Example 2.7.7 that 


f| 2x? +(1/3) 0<x<1 
Ix@) = | 0 otherwise, 


while i 
5 
_| 3y +2 0<y<l 
fro) | 0 otherwise. 


Chapter 2: Random Variables and Distributions 97 


Let us now compute P(0.2 < Y < 0.3|X = 0.8). Using Definitions 2.8.4 
and 2.8.3, we have 


P(0.2 < Y <03|X =0.8) 


O ex os Soo fx x08, y)dy 7 23 (4 (0.8) y +2y%) dy 
~ Jo2 Seles fx (0.8) g 2 (0.8) +4 

4 2 2 2 2 6 6 

4 (0.8)? ((0.3)* — (0.2)*) + 2((0.3)® — (0.2)6) 
= ——<—=— = = 0.0398. 


By contrast, if we compute the unconditioned (i.e., usual) probability that 0.2 < 
Y < 0.3, we see that 


0.3 03 4 


P(0.2 < Y < 0.3) frody=] Gy +2y°)dy 
0.2 02 3 


41 2 
73 (03) — (0.2)?) + <((0.3)° — (0.2)°) = 0.0336. 
We thus see that conditioning on X = 0.8 increases the probability that 0.2 < Y < 0.3, 
from about 0.0336 to about 0.0398. E 
By analogy with Theorem 1.3.1, we have the following. 
Theorem 2.8.1 (Law of total probability, absolutely continuous random variable 


version) Let X and Y be jointly absolutely continuous random variables, and let 
a <bandc < d. Then 


d b 
Pa<X<be<¥<d)= [ | free) frix |x) dx dy. 


More generally, if B C R° is any region, then 


P((X, Y) eB) = | | A Sux lx) dx dy. 


PROOF | By Definition 2.8.3, 
Fx) fyixO |x) = fx, y &, y). 


Hence, the result follows immediately from Definition 2.7.4 and Theorem 2.7.6. E 


2.8.3 | Independence of Random Variables 


Recall from Definition 1.5.2 that two events A and B are independent if P(A N B) = 
P(A) P(B). We wish to have a corresponding definition of independence for random 
variables X and Y. Intuitively, independence of X and Y means that X and Y have no 


98 Section 2.8: Conditioning and Independence 


influence on each other, i.e., that the values of X make no change to the probabilities 
for Y (and vice versa). 

The idea of the formal definition is that X and Y give rise to events, of the form 
“a < X <b” or“Y e B,” and we want all such events involving X to be independent 
of all such events involving Y. Specifically, our definition is the following. 


Definition 2.8.5 Let X and Y be two random variables. Then X and Y are inde- 
pendent if, for all subsets Bı and B2 of the real numbers, 


P(X € B1, Y € B2) = P(X € Bi)P(Y € By). 


That is, the events “X e Bı” and “Y €e B2” are independent events. 


Intuitively, X and Y are independent if they have no influence on each other, as we 
shall see. 

Now, Definition 2.8.5 is very difficult to work with. Fortunately, there is a much 
simpler characterization of independence. 


Theorem 2.8.2 Let X and Y be two random variables. Then X and Y are indepen- 
dent if and only if 


P(a<X<b,c<Y<d)=P(a<X<b)P(c<Y <d) (2.8.1) 


whenever a < bandc < d. 


That is, X and Y are independent if and only if the events “a < X < b” and “c < Y < 
d” are independent events whenever a < b and c < d. 


We shall not prove Theorem 2.8.2 here, although it is similar in spirit to the proof of 
Theorem 2.5.1. However, we shall sometimes use (2.8.1) to check for the independence 
of X and Y. 

Still, even (2.8.1) is not so easy to check directly. For discrete and for absolutely 
continuous distributions, easier conditions are available, as follows. 


Theorem 2.8.3 Let X and Y be two random variables. 
(a) If X and Y are discrete, then X and Y are independent if and only if their joint 
probability function p y,y satisfies 


px, yx, y) = px) prO) 


forall x,y € R!. 
(b) If X and Y are jointly absolutely continuous, then X and Y are independent if 
and only if their joint density function fy y can be chosen to satisfy 


Ixy, y) = fx) fro) 


forall x,y € R!. 


Chapter 2: Random Variables and Distributions 99 


PROOF | (a) If X and Y are independent, then setting a = b = x and c = d = y 
in (2.8.1), we see that P(X =x, Y = y) = P(X = x)P (Y = y). Hence, py,y x, y) = 


Px(x) py). 
Conversely, if px,y (x,y) = px(x) py (y) for all x and y, then 


Pa<X<b,c<Y<d) 


= > 5 px,y œ, y) = DA 5 px&) prO) 


a<x<b c<y<d a<x<b c<y<d 
-( Z no)( Z ro) resxsoresr <a 
a<x<b c<y<d 


This completes the proof of (a). 
(b) If fx,y@, ¥) = fx(x) fy) for all x and y, then 


Pia<X <b,c<Y<d) 


=f f trrenayar= f f 1000 frovayay 


b d 
-(/ fxts)ds) (/ fr)dy) = Pas X <b) Ple<¥ sd) 


This completes the proof of the “if” part of (b). The proof of the “only if” part of (b) is 
more technical, and we do not include it here. E 


EXAMPLE 2.8.3 
Let X and Y have, as in Example 2.7.6, joint density 


_ | 4x?7y +2y? O<x<1,0<y<1 
fxy@y) = | 0 otherwise 
and so, as derived in as in Example 2.7.7, marginal densities 
_ f 2x? +(1/3) 05-21 
Ix) = | 0 otherwise 
and å : 
_— | 3 +2y O<y<l 
fr) | 0 otherwise. 
Then we compute that 
_ J @x? + (1/3)) Gy + 2y°) O<x<1,0<y<l 
Ix@) fro) = | 0 otherwise. 


We therefore see that fy(x) fy) # fx,y œ, y). Hence, X and Y are not independent. 
I 


100 Section 2.8: Conditioning and Independence 


EXAMPLE 2.8.4 
Let X and Y have joint density 
1 2 2 
= gogo (i2xy~ + 6x +4y* + 2) O<x <6,3<y<5 
fxr, y) | 0 otherwise. 


We compute the marginal densities as 


ix) = T. fx, yœ, y)dy = | z ta ae 
and is 
fy) = J fx,y@,y)dx = | zi + ror” aa 
Then we compute that 
(gg + 35) Goa + 7019”) O<x<6,3<y<5 
fx) fro) = | a 20 202 101 otherwise: 


Multiplying this out, we see that fx(x) fy) = fx,y(x,y). Hence, X and Y are 
independent in this case. E 


Combining Theorem 2.8.3 with Definitions 2.8.2 and 2.8.3, we immediately obtain 
the following result about independence. It says that independence of random vari- 
ables is the same as saying that conditioning on one has no effect on the other, which 
corresponds to an intuitive notion of independence. 


Theorem 2.8.4 Let X and Y be two random variables. 
(a) If X and Y are discrete, then X and Y are independent if and only if pyx |x) = 


py(y), for every x, y € RI. 
(b) If X and Y are jointly absolutely continuous, then X and Y are independent if 


and only if fyjxO |x) = fy(), for every x, y € R!. 


While Definition 2.8.5 is quite difficult to work with, it does provide the easiest 
way to prove one very important property of independence, as follows. 


Theorem 2.8.5 Let X and Y be independent random variables. Let f, g : R! > R! 


be any two functions. Then the random variables f (X) and g(Y) are also indepen- 
dent. 


PROOF | Using Definition 2.8.5, we compute that 


P(f(X) € Bi, gY) eB) = P(X e fB), Y ege) 


= P (x E f-"(Bv)) P (r € g7 (B2) 
= P(f(X) e Bı) P(g) € B2). 


Chapter 2: Random Variables and Distributions 101 


(Here f~'(B1) = {x € R! : f(x) € Bi} and g7! (B2) = {y € R! : g) € B},) 
Because this is true for any Bı and Bo, we see that f (X) and g (Y) are independent. E 


Suppose now that we have n random variables X1, ..., Xn. The random variables 
are independent if and only if the collection of events {a; < X; < bi} are independent, 
whenever a; < bi, for alli = 1,2,...,n. Generalizing Theorem 2.8.3, we have the 
following result. 


Theorem 2.8.6 Let X1, ..., Xn be a collection of random variables. 
(a) If X1, ..., Xn are discrete, then X1,..., Xn are independent if and only if their 
joint probability function py,,....x, satisfies 


PX1,..,Xn (01, <- -> Xn) = PX, &1) PX, On) 


for all x1,..., Xn € R. 
(b) If X1,..., Xn are jointly absolutely continuous, then X1, ..., Xn are indepen- 
dent if and only if their joint density function fy... can be chosen to satisfy 


IXs Xn OY) = fx, 1) --- fx, On) 


forall x1,..., Xn E R!. 


A particularly common case in statistics is the following. 


Definition 2.8.6 A collection X1, ..., Xn of random variables is independent and 
identically distributed (or i.i.d.) if the collection is independent and if, furthermore, 


each of the n variables has the same distribution. The i.i.d. sequence X1, ..., Xn is 
also referred to as a sample from the common distribution. 


In particular, if a collection X1,..., Xn of random variables is i.i.d. and discrete, then 
each of the probability functions px, is the same, so that py, (x) = ppu &) = = 
px, (x) = p(x), for all x e R!. Furthermore, from Theorem 2.8.6(a), it follows that 


PR yegXy X15 +++ Xn) = PX, X1) PX (X2)- ++ PX, On) = pip): Pn) 


for all x1,..., Xn € RI. 

Similarly, if a collection X1, ..., Xn of random variables is i.i.d. and jointly ab- 
solutely continuous, then each of the density functions fy, is the same, so that fy, (x) = 
faa) = ++: = fx, x)= f@), forallx e R!. Furthermore, from Theorem 2.8.6(b), 
it follows that 


SXi, Xa 1, ++ 5 Xn) = fx, 1) fx, 2) +++ fx, On) = £1) f x2) -e f On) 


for all x1,...,x, € RL. 
We now consider an important family of discrete distributions that arise via sam- 
pling. 


102 Section 2.8: Conditioning and Independence 


EXAMPLE 2.8.5 Multinomial Distributions 
Suppose we have a response s that can take three possible values — for convenience, 
labelled 1,2, and 3 — with the probability distribution 


P (s =1) =61, P (s = 2) = b2, P (s =3) = 


so that each 0; > 0 and 01 + 02 +03 = 1. As a simple example, consider a bowl 
of chips of which a proportion 0; of the chips are labelled i (fori = 1,2,3). If 
we randomly draw a chip from the bowl and observe its label s, then P (s =i) = 6;. 
Alternatively, consider a population of students at a university of which a proportion 01 
live on campus (denoted by s = 1), a proportion @ live off-campus with their parents 
(denoted by s = 2), and a proportion 63 live off-campus independently (denoted by 
s = 3). If we randomly draw a student from this population and determine s for that 
student, then P (s = i) = 0; 
We can also write l 
P(s=i)= gO gage 


fori €e {1,2,3}, where /;;; is the indicator function for {j}. Therefore, if (s1,..., Sn) 
is a sample from the distribution on {1, 2, 3} given by the 6;, Theorem 2.8.6(a) implies 
that the joint probability function for the sample equals 


P(s1 = hay... Sa = ka) = [LO Maelo) — eng262 (28.2) 
j=l 


where x; = $51 Zu) (kj) is equal to the number of i’s in (ki, ..., kn) - 
Now, based on the sample (s1, ..., Sn) , define the random variables 


n 
X; = e 
j=l 


fori = 1,2, and 3. Clearly, X; is the number of i’s observed in the sample and we 
always have X; € {0,1,...,2} and X1 + X2 + X3 = n. We refer to the X; as the 
counts formed from the sample. 

For (x1, x2, x3) satisfying x; € {0,1,...,n}andx,; +x2 + x3 = n, (2.8.2) implies 
that the joint probability function for (X1, X2, X3) is given by 


P(X,X2,X3) %1,%2,%3) = P(X, =x1, X2 = x2, X3 = x3) 
C (x1, x2, x3)07'07 OF 


where C (x1, x2, x3) equals the number of samples (s1, ..., Sn) with x; of its elements 
equal to 1, x2 of its elements equal to 2, and x3 of its elements equal to 3. To calcu- 
late C (x1, x2, x3), we note that there are (a) choices for the places of the 1’s in the 


n—X| 


sample sequence, ( m 


(E R 


) choices for the places of the 2’s in the sequence, and finally 


)= = | choices for the places of the 3’s in the sequence (recall the multino- 
miai coefficient defined in (1.4.4)). Therefore, the probability function for the counts 


Chapter 2: Random Variables and Distributions 103 


(X1, X2, X3) is equal to 


P(X1,X2,X3) %1, X2, x3) 


(MC 
X1 X2 X3 

n 
= 07107203. 


(X1, X2, X3) ~ Multinomial (n, 01, 02,03). 


We say that 


Notice that the Multinomial(n, 01, 02, 03) generalizes the Binomial (n, 0) distribu- 
tion, as we are now counting the number of response values in three possible categories 
rather than two. Also, it is immediate that 


Xi ~ Binomial (n, 0;) 


because X; equals the number of occurrences of i in the n independent response values, 
and i occurs for an individual response with probability equal to 0; (also see Problem 
2.8.18). 

As a simple example, suppose that we have an urn containing 10 red balls, 20 white 
balls, and 30 black balls. If we randomly draw 10 balls from the urn with replacement, 
what is the probability that we will obtain 3 red, 4 white, and 3 black balls? Because 
we are drawing with replacement, the draws are i.i.d., so the counts are distributed 
Multinomial(10, 10/60, 20/60, 30/60) . The required probability equals 


10 \ (10\? (20\* (30\3 2 
—) (=) (=) =3.0007 x 107. 
ea. (3) (3) (3) x 


Note that if we had drawn without replacement, then the draws would not be i.i.d., the 
counts would thus not follow a multinomial distribution but rather a generalization of 
the hypergeometric distribution, as discussed in Problem 2.3.29. 

Now suppose we have a response s that takes k possible values — for convenience, 
labelled 1,2, ...,k — with the probability distribution given by P (s = i) = 6;. For 
a sample (s1, ..., Sn), define the counts X; = Èi Iu (sj) fori = 1,...k. Then, 
arguing as above and recalling the development of (1.4.4), we have 


n 
ane = ou... 0% 
P(X, X O1 k) (a as B) ! i 


whenever each x; € {0,..., n}and x1 +---+x, = n. In this case, we write 


(Xi, ..., Xk) ~ Multinomial (n, 01,..., 0%). E 


2.8.4 | Order Statistics 


Suppose now that (X1, ... , Xn) is a sample. In many applications of statistics, we will 
have n data values where the assumption that these arise as an i.i.d. sequence makes 


104 Section 2.8: Conditioning and Independence 


sense. It is often of interest, then, to order these from smallest to largest to obtain the 
order statistics 


X(1), +++ X). 
Here, Xg) is equal to the ith smallest value in the sample X1, . . . , Xn. So, for example, 
ifn = 5 and 
Xı =2.3, X2 = 4.5, X3 = —1.2, X4 =2.2, X5 = 4.3 
then 


Xo = —1.2, Xo = 212, X6) Z2 X (4) = 4.3, X(5) = 4°5: 


Of considerable interest in many situations are the distributions of the order statis- 
tics. Consider the following examples. 


EXAMPLE 2.8.6 Distribution of the Sample Maximum 
Suppose X1, X2,..., Xn are i.i.d. so that Fy, œ) = Fy, x) =--- = Fy, (x). Then 
the /argest-order statistic X(n) = max(X|, X2,...,Xn) is the maximum of these n 
random variables. 

Now X(n) is another random variable. What is its cumulative distribution function? 
We see that Xn) < x if and only if X; < x for all i. Hence, 

Fy) P(X) <x) = P(X <x, X% <x, ..., Xn < x) 

= P(X <x)P(X2 <x) Pn < x) = Fx @) Px): ++ Fx, @) 


(Fx, (x))" 


If Fy, corresponds to an absolutely continuous distribution, then we can differentiate 
this expression to obtain the density of X(n). E 


EXAMPLE 2.8.7 

As a special case of Example 2.8.6, suppose that X1, X2,..., Xn are identically and 
independently distributed Uniform[0, 1]. From the above, for0 < x < 1, we have 
F Xon (x) = (F Yi (x))" = x”. It then follows from Corollary 2.5.1 that the density 
AX) Of Xm equals fx, E) = FY. (x) = nx”! for 0 < x < 1, with (of course) 
Í, Xei (x) = 0 forx < 0andx > 1. Note that, from Problem 2.4.24, we can write 
Xn) ~ Beta(n, 1).1 


EXAMPLE 2.8.8 Distribution of the Sample Minimum 
Following Example 2.8.6, we can also obtain the distribution function of the sample 
minimum, or smallest-order statistic, Xa) = min(X), X2,..., Xn). We have 


Fx) = P(Xa <x) 
= 1—P(Xo >x) 
= l- P(X% >x, X% >x, ...,Xn> x) 
= 1—P(X%, >x) P(X. > x)--- P(Xn > x) 
= 1 — (1 — Fx, («)) (1 — Fx, œ) (I — Fx, &)) 
= 1-(1-Fy,(x))”. 


Chapter 2: Random Variables and Distributions 105 


Again, if Fy, corresponds to an absolutely continuous distribution, we can differentiate 
this expression to obtain the density of X(1). E 


EXAMPLE 2.8.9 
Let X1,..., Xn be iid. Uniform[0, 1]. Hence, for 0 <x < 1, 


Fyy@) = PX <x) =1- PX > x)=1- (1 -— x)”. 


It then follows from Corollary 2.5.1 that the density fy.) of Xq) satisfies fy) = 
Ps (x) =n(1 —x)""! for 0 < x < 1, with (of course) SX) = 0 for x < 0 and 
x > 1. Note that, from Problem 2.4.24, we can write Xq) ~ Beta(1, n). E 


The sample median and sample quartiles are defined in terms of order statistics 
and used in statistical applications. These quantities, and their uses, are discussed in 
Section 5.5. 


Summary of Section 2.8 


e If X and Y are discrete, then the conditional probability function of Y, given X, 
equals pyix(v |x) = px,y@, y)/ px@). 

e If X and Y are absolutely continuous, then the conditional density function of Y, 
given X, equals fy|x(y |x) = fx,y@, y)/fx@). 

e X and Y are independent if P(X € Bi, Y € B2) = P(X e Bı)P(Y € B2) for 

all B}, By C RI. 

Discrete X and Y are independent if and only if pyx ,y œ, y) = px(x)py(y) for 

all x, y € R! or, equivalently, pyjxO |x) = pr). 

Absolutely continuous X and Y are independent if and only if fy y(x,y) = 

fx(x) fy(y) for all x, y e R! or, equivalently, fyjx(y |x) = fry). 

e A sequence X1, X2,..., Xn is i.i.d. ifthe random variables are independent, and 
each X; has the same distribution. 


EXERCISES 


2.8.1 Suppose X and Y have joint probability function 


1/6 x=-2, y=3 
1/12 x=-2,y=5 
1/6 x=9, y=3 

Pxy@,y)= 4 1/12 x=9, y=5 
1/3 x=13, y=3 
1/6 x=13, y=5 
0 otherwise. 


(a) Compute py(x) for all x € R!. 
(b) Compute py(y) for all y € R!. 
(c) Determine whether or not X and Y are independent. 


106 Section 2.8: Conditioning and Independence 


2.8.2 Suppose X and Y have joint probability function 


1/16 a y= 
1/4 x=-2,y=5 
1/2 x=9,y=3 

Pxy@,y)= 4 1/16 x=9,y=5 
1/16 FS 13.ey7'= 
1/16 x=13, y=5 
0 otherwise. 


(a) Compute py(x) for all x e R!. 
(b) Compute py(y) for all y € R!. 
(c) Determine whether or not X and Y are independent. 


2.8.3 Suppose X and Y have joint density function 


x,y) = j 
fx, yœ, y) otherwise. 


ap (2 +x +xy +4y’) 0<x<1,0<y<1 
0 

(a) Compute fy(x) for all x € R!. 

(b) Compute fy(y) for all y e R!. 

(c) Determine whether or not X and Y are independent. 


2.8.4 Suppose X and Y have joint density function 


2 x x x+y 0 <x < l, 
sog (3B HEA +3y + 3ye” + ye + yet) 
fxy@,y) = 4 9 es) 
0 otherwise 


(a) Compute fy(x) for all x e R!. 
(b) Compute fy(y) for all y e RI. 
(c) Determine whether or not X and Y are independent. 


2.8.5 Suppose X and Y have joint probability function 


1/9 x = —4, y = —2 

2/9 =s pS —2 

_ | 3/9 x=9, y=-2 
Px,y%,y)= 2/9 x=9, y= 
1/9 x=9, y= 
0 otherwise. 


(a) Compute P(Y = 4| X = 9). 

(b) Compute P(Y = —2| X = 9). 
(c) Compute P(Y = 0| X = —4). 
(d) Compute P(Y = —2| X = 5). 
(e) Compute P(X = 5| Y = —2). 


Chapter 2: Random Variables and Distributions 107 


2.8.6 Let X ~ Bernoulli(@) and Y ~ Geometric(@), with X and Y independent. Let 
Z = X + Y. What is the probability function of Z? 

2.8.7 For each of the following joint density functions fy y (taken from Exercise 2.7.4), 
compute the conditional density fy,y(y |x), and determine whether or not X and Y are 


independent. 
(a) 5 5 
_ | 2x*y+Cy Cee el Vays! 
fx yœ, y) = | 0 otherwise. 
(b) 5,5 
_ | Cry +x°y’) 0Osx<l,0<y<l 
Jfxy&,y)= | 0 otherwise. 
(c) 5.5 
_ [ Coy +25y5) O0<x<4,0<y<10 
Ixy, y) = | 0 otherwise. 
(d) 6.6 
_ f Cx5y 0<x<4,0<y<10 
fxy&,y)= | 0 otherwise. 


2.8.8 Let X and Y be jointly absolutely continuous random variables. Suppose X ~ 
Exponential(2) and that P(Y > 5| X =x) = e7™. Compute P(Y > 5). 

2.8.9 Give an example of two random variables X and Y, each taking values in the set 
{1,2,3}, such that P(X = 1,Y = 1) = P(X = 1) P(Y = 1), but X and Y are not 
independent. 

2.8.10 Let X ~ Bernoulli (0) and Y ~ Bernoulli(w), where 0 <@ <land0 < y < 
1. Suppose P(X = 1, Y = 1) = P(X = 1) P(Y = 1). Prove that X and Y must be 
independent. 


2.8.11 Suppose that X is a constant random variable and that Y is any random variable. 
Prove that X and Y must be independent. 

2.8.12 Suppose X ~ Bernoulli(1/3) and Y ~ Poisson(4), with X and Y independent 
and with 1 > 0. Compute P(X = 1 |Y = 5). 

2.8.13 Suppose P(X = x, Y = y) = 1/8 for x = 3,5 and y = 1,2,4,7, otherwise 
P(X =x, Y=y)=0. 

(a) Compute the conditional probability function py,x(y|x) for all x,y € R! with 
px(x)> 0. 

(b) Compute the conditional probability function pyy (xly) for all x,y € R! with 
py(y) > 0. 

(c) Are X and Y independent? Why or why not? 

2.8.14 Let X and Y have joint density fy,y(x, y) = (x? + y)/36 for —2 < x < 1 and 
0 < y <4, otherwise fy y@, y) = 0. 

(a) Compute the conditional density fy|x(y|x) for all x, y € R! with fy(x) > 0. 

(b) Compute the conditional density fyıy (x|y) for all x, y e R! with fy (y) > 0. 

(c) Are X and Y independent? Why or why not? 


108 Section 2.8: Conditioning and Independence 


2.8.15 Let X and Y have joint density fy y(x,y) = Ge +y)/A4fr0 <x <y 22: 
otherwise fy y(x, y) = 0. Compute each of the following. 

(a) The conditional density fyjx (|x) for all x, y € R! with fy(x) > 0 

(b) The conditional density fy|y (x|y) for all x, y € R! with fy) > 0 

(c) Are X and Y independent? Why or why not? 

2.8.16 Suppose we obtain the following sample of size n = 6: X; = 12, X? = 8, 
X3 = X4 = 9, X5 = 7, and X6 = 11. Specify the order statistics Xg) for 1 <i < 6. 


PROBLEMS 


2.8.17 Let X and Y be jointly absolutely continuous random variables, having joint 
density of the form 


_ | CiQ@x*y + Cry’) 0O<x<1,0<y<1 
Ixy@.y) = | 0 otherwise. 


Determine values of Cı and C2, such that fy,y is a valid joint density function, and X 
and Y are independent. 

2.8.18 Let X and Y be discrete random variables. Suppose px,y (x, y) = g(x) AQ), 
for some functions g and h. Prove that X and Y are independent. (Hint: Use Theo- 
rem 2.8.3(a) and Theorem 2.7.4.) 

2.8.19 Let X and Y be jointly absolutely continuous random variables. Suppose 

fx, yœ, y) = g(x) h), for some functions g and A. Prove that X and Y are indepen- 
dent. (Hint: Use Theorem 2.8.3(b) and Theorem 2.7.5.) 

2.8.20 Let X and Y be discrete random variables, with P(X = 1) > 0 and P(X = 
2) > 0. Suppose P(Y = 1 |X = 1) = 3/4 and P(Y = 2| X = 2) = 3/4. Prove that 
X and Y cannot be independent. 

2.8.21 Let X and Y have the bivariate normal distribution, as in Example 2.7.9. Prove 
that X and Y are independent if and only if p = 0. 

2.8.22 Suppose that (X1, X2, X3) ~ Multinomial (n, 01, 02, 03). Prove, by summing 
the joint probability function, that X; ~ Binomial (n, 01). 

2.8.23 Suppose that (X1, X2, X3) ~ Multinomial (n, 01, 02, 03) . Find the conditional 
distribution of X2 given that X1 = x). 

2.8.24 Suppose that X1, ..., Xn is a sample from the Exponential(4) distribution. 
Find the densities fx), and fx,,,. 

2.8.25 Suppose that X1, ..., Xn is a sample from a distribution with cdf F. Prove that 


ay (Nr OU- FE. 
j=i 


(Hint: Note that Xa) < x if and only if at least i of X1, . . . , Xn are less than or equal 
to x.) 

2.8.26 Suppose that X1, ..., X5 is a sample from the Uniform[0, 1] distribution. If we 
define the sample median to be X(3), find the density of the sample median. Can you 
identify this distribution? (Hint: Use Problem 2.8.25.) 


Chapter 2: Random Variables and Distributions 109 


2.8.27 Suppose that (X, Y) ~ Bivariate Normal (u41, 2,01, 02, p). Prove that Y given 
X =x is distributed N (u3 + p02 (x — u1) /o1, (1 — p°) 03). Establish the analogous 
result for the conditional distribution of X given Y = y. (Hint: Use (2.7.1) for Y given 
X =x and its analog for X given Y = y.) 


CHALLENGES 


2.8.28 Let X and Y be random variables. 

(a) Suppose X and Y are both discrete. Prove that X and Y are independent if and only 
if P(Y =y|X =x) = P(Y = y) for all x and y such that P(X = x) > 0. 

(b) Suppose X and Y are jointly absolutely continuous. Prove that X and Y are inde- 
pendent if and only if P(a < Y < b| X =x) = P(a < Y < b) forall x and y such 
that fx(x) > 0. 


2.9 | Multidimensional Change of Variable 


Let X and Y be random variables with known joint distribution. Suppose that Z = 
hi (X, Y) and W = ho(X, Y), where hy, h2 : R? — R! are two functions. What is the 
joint distribution of Z and W? 

This is similar to the problem considered in Section 2.6, except that we have moved 
from a one-dimensional to a two-dimensional setting. The two-dimensional setting is 
more complicated; however, the results remain essentially the same, as we shall see. 


2.9.1 | The Discrete Case 


If X and Y are discrete random variables, then the distribution of Z and W is essentially 
straightforward. 


Theorem 2.9.1 Let X and Y be discrete random variables, with joint probability 
function py,y. Let Z = hı (X, Y) and W = ho(X, Y), where hy, hz : R? — R! are 
some functions. Then Z and W are also discrete, and their joint probability function 
pz,w satisfies 


Pzw,w) = è? px, yœ, y). 


X, y 
hy œ,y)=z, hg œ, y)=w 


Here, the sum is taken over all pairs (x, y) such that h1 (x, y) = z and h2 (x, y) = w. 


PROOF| We compute that pz, w (Z, w) = P(Z = z, W = w) = P(A\(X,Y) = 
z, ha(X, Y) = w). This equals 


S  PÆ=xY=y= Š were); 
xy X, y 
hy œ,y)=z, ha (x,y) =o hy &,y)=z, ho (x, y)=w 
as claimed. E 


As a special case, we note the following. 


110 Section 2.9: Multidimensional Change of Variable 


Corollary 2.9.1 Suppose in the context of Theorem 2.9.1 that the joint function 
h = (hı,h2) : R? > R? defined by h(x, y) = (hi (x,y), h(x, y)) is one-to- 
one, i.e., if hi @1, y1) = hı (x2, y2) and A2(x1, y1) = h2(X2, y2), then xı = x2 and 


yi =y2. Then 
pz,w(z, w) = px,y (h™' E, w)), 


where A~! (z, w) is the unique pair (x, y) such that h(x, y) = (z, w). 


EXAMPLE 2.9.1 
Suppose X and Y have joint density function 
1/6 x=2,y=6 
1/12 x = —2, y = —6 
px, y, y) = 1/4 x = —3,y =11 
1/2 x=3,y=-8 
0 otherwise. 


Let Z = X +Y and W =Y — X. Then pz,w (8,2) = P(Z = 8,W = 2) = 
P(X =2,Y = 6) + P(X = -3,Y = 11) = 1/6 + 1/4 = 5/12. On the other hand, 
pz,w(-5, -17) = P(Z = —5, W = —17) = P(X =3, Y = —8) =35.0 


2.9.2 | The Continuous Case (Advanced) 


If X and Y are continuous, and the function h = (hı, h2) is one-to-one, then it is 
again possible to compute a formula for the joint density of Z and W, as the following 
theorem shows. To state it, recall from multivariable calculus that, if h = (h1, h2) : 
R? - R? is a differentiable function, then its Jacobian derivative J is defined by 


hi oh 

əx Ox oh, ðh ôhz Oh 
J(x, y) = det A ee 

dh, hz ôx Oy ôx Oy 

oy oy 


Theorem 2.9.2 Let X and Y be jointly absolutely continuous, with joint density 
function fy y. Let Z = hı (X, Y) and W = ho(X, Y), where hy, h2 : R? —> R! are 
differentiable functions. Define the joint function h = (h1, h2) : R? > R? by 


A(x, y) = (i, y), h2&, y)). 


Assume that A is one-to-one, at least on the region {(x, y): f(x, y) > 0}, i.e., if 
hi@1, y1) = hı 2, y2) and h2(x1, y1) = h22, y2), then xı = x2 and yı = y2. 
Then Z and W are also jointly absolutely continuous, with joint density function 
fz, w given by 


fz,wE, w) = fxr, w)) / |J (h71, w), 


where J is the Jacobian derivative of h and where AT! (z, w) is the unique pair 
(x, y) such that h(x, y) = (z, w). 


Chapter 2: Random Variables and Distributions 111 


PROOF | See Section 2.11 for the proof of this result. E 


EXAMPLE 2.9.2 
Let X and Y be jointly absolutely continuous, with joint density function fy,y given 
by 
_ | 4x2y + 2y° 0<x<1,0<y<1 
Ixy œ, y) = | 0 otherwise, 


as in Example 2.7.6. Let Z = X + Y? and W = X — Y?. What is the joint density of 
Z and W? 

We first note that Z = h1 (X, Y) and W = h2 (X, Y), where hı (œ, y) = x + y? and 
ho(x, y) =x — y’. Hence, 


Tepa Aae a AETS, 
Ox ôy ôx ôy 


We may invert the relationship by solving for X and Y, to obtain that 


1 |Z- 
HS Se Wand Y= sui 


This means that h = (h1, h2) is invertible, with 


h-'(z,w) = (Ge+w). ca) 


Hence, using Theorem 2.9.2, we see that 


Sz,w(Z, w) 


fx, y h7 E, wy) / I7 E, w))| 


fr (Jere 52) wore w))| 
1 
l 2 [Zw Z—w 2 Z—w 0< AG +w)<l, 
(Gern PEN) ja 2 yae] 


0 otherwise 
= [EPE — 0setwse2,0e2-ws2 
0 otherwise. 


We have thus obtained the joint density function for Z and W. E 


EXAMPLE 2.9.3 
Let U; and U? be independent, each having the Uniform[0, 1] distribution. (We could 
write this as U1, U2 are i.i.d. Uniform[0, 1].) Thus, 


fi 0<u,<1,0<u <1 
Jui, 1, u2) -| 0 otherwise. 


112 Section 2.9: Multidimensional Change of Variable 


Then define X and Y by 


X = 42 log(1/U1) cos2aU2), Y = y2 log(1/U)) sin@zU2). 


What is the joint density of X and Y? 
We see that here X = hı (U1, U2) and Y = h2 (U1, U2), where 


hı(u1, u2) = y2 log(1/u1) cos(2mu2), h2(u1, u2) = y2 log(1/u1) sin@zuz2). 


Therefore, 
a = - 2 log(1 /2 —1/u? , 
T thea) = 3 @ log(/u)) (2u1(=1/14)) cos@ru2) 


Continuing in this way, we eventually compute (see Exercise 2.9.1) that 


oh, oh ðh oh 2 2: 
J(u1, u2) = A net = (cos* 2x2) + sin*2xu2)) = mn 


On the other hand, inverting the relationship A, we compute that 
Ui = e72 +2 U2 = arctan(Y/X) / 2r. 
Hence, using Theorem 2.9.2, we see that 


fry») = fuh œ, y) IA Œ, y) 
= fu, fe arctan(y/x) /27) 


zi 
x ly (eco arctan(y/x) /27 )| 


OTE 
11-2 jA Oset <1, 
= 0 < arctan(y/x) /2a < 1 
0 otherwise 
1 


— — e 
2m 


(x?-+y?)/2 
where the last expression is valid for all x and y, because we always have 
0 < eH <] 


and 0 < arctan(y/x) /2m < 1. 
We conclude that 


1 2 1 2 
7 — | — e —™* /2 } | — ey 2 
LOS) (z ) (ae í ). 


We recognize this as a product of two standard normal densities. We thus conclude that 
X ~ N(0, 1) and Y ~ N(O, 1) and that, furthermore, X and Y are independent. E 


Chapter 2: Random Variables and Distributions 113 


2.9.3 | Convolution 


Suppose now that X and Y are independent, with known distributions, and that Z = 
X + Y. What is the distribution of Z? In this case, the distribution of Z is called the 
convolution of the distributions of X and of Y. Fortunately, the convolution is often 
reasonably straightforward to compute. 


Theorem 2.9.3 Let X and Y be independent, and let Z = X + Y. 
(a) If X and Y are both discrete, with probability functions py and py, then Z is 
also discrete, with probability function pz given by 


pz) = Š px@ — w)py(w). 


(b) If X and Y are jointly absolutely continuous, with density functions fy and fy, 
then Z is also absolutely continuous, with density function fz given by 


fz) =| fx -— w)fy(w)dw. 


(a) We let W = Y and consider the two-dimensional transformation from 
(X, Y) to (Z, W) = (X +Y, Y). 

In the discrete case, by Corollary 2.9.1, pz,w (z, w) = px,y (Z — w, w). Then from 
Theorem 2.7.4, pz(z) = >, pz,w, w) = >)» px,y@ — w, w). But because X 
and Y are independent, px,y œ, y) = px(x) py), so px,y (Z — w, w) = px — 
w) py (w). This proves part (a). 

(b) In the continuous case, we must compute the Jacobian derivative J (x, y) of the 
transformation from (X, Y) to (Z, W) = (X + Y, Y). Fortunately, this is very easy, as 
we obtain 


E OOS ED. S 
ION ae ag ae a OOO 


Hence, from Theorem 2.9.2, fz, w(z, w) = fx,y € — w, w)/|1| = fx, y@ — w, w) and 
from Theorem 2.7.5, 


zo=f Sz,w, wdw= f fxy@—w,w)dw. 


But because X and Y are independent, we may take fy y(x,yv) = fx(x) fr), so 
fx,y (z — w, w) = fx — w) fy(w). This proves part (b). E 

EXAMPLE 2.9.4 

Let X ~ Binomial(4, 1/5) and Y ~ Bernoulli (1/4), with X and Y independent. Let 
Z = X + Y. Then 


pz8@) = P(X+Y=3)=P(X=3,Y =0)+ P(X =2,Y=1) 
4 4 

GUES (3/4) + (assy (4/5)° (1/4) 

= 4(1/5)>(4/5)! (3/4) + 6(1/5)7 (4/5)? (1/4) = 0.0576. § 


114 Section 2.9: Multidimensional Change of Variable 


EXAMPLE 2.9.5 
Let X ~ Uniform[3,7] and Y ~ Exponential(6), with X and Y independent. Let 
Z = X + Y. Then 


£25) 


i 5 
1 fx) Jr -—x)dx =í (1/4) 6 eS) dx 


5 


=(1/4)e PO) | = —(1/4)e™!? + (1/4)e? = 0.2499985. 


= 
x=3 
Note that here the limits of integration go from 3 to 5 only, because fy(x) = 0 for 
x < 3, while fy(6 —x)=0forx > 5.5 


Summary of Section 2.9 


e If X and Y are discrete, and Z = hı (X, Y) and W = h2(X, Y), then 


Pzw,w) = 5 Px, yœ, y). 
(œx): hŒ, y)=z, hoe y)=w} 


e If X and Y are absolutely continuous, if Z = hı (X, Y) and W = h2(X, Y), and 
if h = (hı, h2) : R? — R? is one-to-one with Jacobian J (x, y), then 
fz,wE, w) = fx,y (h7 E, w))/1J h7 E, w). 

e This allows us to compute the joint distribution of functions of pairs of random 
variables. 


EXERCISES 


2.9.1 Verify explicitly in Example 2.9.3 that J (u1, u2) = —27r /u1. 

2.9.2 Let X ~ Exponential(3) and Y ~ Uniform[1, 4], with X and Y independent. 
Let Z = X + Y and W = X — Y. 

(a) Write down the joint density fy y(x, y) of X and Y. (Be sure to consider the ranges 
of valid x and y values.) 

(b) Find a two-dimensional function A such that (Z, W) = A(X, Y). 

(c) Find a two-dimensional function h7! such that (X, Y) = h7! (Z, W). 

(d) Compute the joint density fz, w (z, w) of Z and W. (Again, be sure to consider the 
ranges of valid z and w values.) 

2.9.3 Repeat parts (b) through (d) of Exercise 2.9.2, for the same random variables X 
and Y, if instead Z = X? + Y? and W = X? — Y?. 

2.9.4 Repeat parts (b) through (d) of Exercise 2.9.2, for the same random variables X 
and Y, if instead Z = X +4 and W = Y —3. 

2.9.5 Repeat parts (b) through (d) of Exercise 2.9.2, for the same random variables X 
and Y, if instead Z = Y4 and W = X4. 


Chapter 2: Random Variables and Distributions 115 


2.9.6 Suppose the joint probability function of X and Y is given by 


1/7 x=5,y=0 
1/7 S55, y S3 
_ J 1/7 x=5,y=4 
PRY Y=) 377 x =8 y=0 
1/7 x=8,y=4 
0 otherwise. 


Let Z =X +Y, W = X —Y, A = X + Y?, and B = 2X — 3Y°?. 
(a) Compute the joint probability function pz, w (z, w). 

(b) Compute the joint probability function p4,g(a, b). 

(c) Compute the joint probability function pz 4(z, a). 

(d) Compute the joint probability function py, g(w, b). 

2.9.7 Let X have probability function 


1/3 x =0 
_ J] 1⁄2 V= 2 
Px) = 1/6 x =3 
0 otherwise, 
and let Y have probability function 
1/6 y=2 
-Jan ys 
0 otherwise. 


Suppose X and Y are independent. Let Z = X + Y. Compute pz(z) for all z € R!. 
2.9.8 Let X ~ Geometric(1/4), and let Y have probability function 


1/6 y=2 
_ J]J 1/12 y=5 
0 otherwise. 


Let W = X + Y. Suppose X and Y are independent. Compute pw (w) for all w € R!. 
2.9.9 Suppose X and Y are discrete, with P(X = 1, Y = 1) = P(X =1,Y = 2) = 
P(X =1,¥ =3) = PX =2,Y = 2) = P(X =2,Y = 3) = 1/5, otherwise 
P(X =x,Y =y) =0. Let Z =X — Y? and W = X? + 5Y. 

(a) Compute the joint probability function pz,w (Z, w) for all z, w € R!. 

(b) Compute the marginal probability function pz (z) for Z. 

(c) Compute the marginal probability function pw (w) for W. 

2.9.10 Suppose X has density fy(x) = x?/4 for 0 < x < 2, otherwise fy(x) = 0, 
and Y has density fy (y) = 5y*/32 for 0 < y < 2, otherwise fy(y) = 0. Assume Y 
and Y are independent, and let Z = X + Y. 


116 Section 2.10: Simulating Probability Distributions 


(a) Compute the joint density fy, y (x, y) forall x, y € R!. 
(b) Compute the density fz(z) for Z. 


PROBLEMS 


2.9.11 Suppose again that X has density fy(x) = x?/4 for 0 < x < 2, otherwise 
fx(x) = 0, that Y has density fy (y) = 5y*/32 for 0 < y < 2, otherwise fy (y) = 0, 
and that X and Y are independent. Let Z = X — Y and W = 4X + 3Y. 

(a) Compute the joint density fz, w (Z, w) for all z, w € Ri, 

(b) Compute the marginal density fz(z) for Z. 

(c) Compute the marginal density fw (w) for W. 

2.9.12 Let X ~ Binomial(nı, 0) independent of Y ~ Binomial(n2, 0). Let Z = 
X + Y. Use Theorem 2.9.3(a) to prove that Z ~ Binomial(nı + n2, 0). 

2.9.13 Let X and Y be independent, with X ~ Negative-Binomial(rı, 0) and Y ~ 
Negative-Binomial(r2, 0). Let Z = X + Y. Use Theorem 2.9.3(a) to prove that Z ~ 
Negative-Binomial (rı + r2, 0). 

2.9.14 Let X and Y be independent, with X ~ N (u1, o?) and Y ~ N(p, aay Let 
Z = X + Y. Use Theorem 2.9.3(b) to prove that Z ~ N(uy + u2, o? + a3). 

2.9.15 Let X and Y be independent, with X ~ Gamma(a,, 4) and Y ~ Gamma(az, 4). 
Let Z = X + Y. Use Theorem 2.9.3(b) to prove that Z ~ Gamma(a, + a2, 4). 
2.9.16 (MV) Show that when Z1, Z2 are i.i.d. N(0, 1) and X, Y are given by (2.7.1), 
then (X, Y) ~ Bivariate Normal (u1, “2,01, 02, p). 


2.10 | Simulating Probability Distributions 


So far, we have been concerned primarily with mathematical theory and manipulations 
of probabilities and random variables. However, modern high-speed computers can 
be used to simulate probabilities and random variables numerically. Such simulations 
have many applications, including: 


To approximate quantities that are too difficult to compute mathematically 


To graphically simulate complicated physical or biological systems 


To randomly sample from large data sets to search for errors or illegal activities, etc. 


To implement complicated algorithms to sharpen pictures, recognize speech, etc. 


To simulate intelligent behavior 


To encrypt data or generate passwords 


To solve puzzles or break codes by trying lots of random solutions 


To generate random choices for online quizzes, computer games, etc. 


Chapter 2: Random Variables and Distributions 117 


Indeed, as computers become faster and more widespread, probabilistic simulations are 
becoming more and more common in software applications, scientific research, quality 
control, marketing, law enforcement, etc. 

In most applications of probabilistic simulation, the first step is to simulate ran- 
dom variables having certain distributions. That is, a certain probability distribution 
will be specified, and we want to generate one or more random variables having that 


distribution. 
Now, nearly all modern computer languages come with a pseudorandom number 
generator, which is a device for generating a sequence Uj, U2, ... of random values 


that are approximately independent and have approximately the uniform distribution 
on [0, 1]. Now, in fact, the U; are usually generated from some sort of deterministic 
iterative procedure, which is designed to “appear” random. So the Uj are, in fact, not 
random, but rather pseudorandom. 

Nevertheless, we shall ignore any concerns about pseudorandomness and shall sim- 
ply assume that 


U1, U2, U3, ... ~ Uniform[0, 1], (2.10.1) 


i.e., the U; are i.i.d. Uniform[0, 1]. 

Hence, if all we ever need are Uniform[0, 1] random variables, then according 
to (2.10.1), we are all set. However, in most applications, other kinds of randomness 
are also required. We therefore consider how to use the uniform random variables 
of (2.10.1) to generate random variables having other distributions. 


EXAMPLE 2.10.1 The Uniform[L, R] Distribution 
Suppose we want to generate X ~ Uniform[L, R]. According to Exercise 2.6.1, we 
can simply set 

X=(R-L)U+L, 


to ensure that X ~ Uniform[L, R]. E 


2.10.1 | Simulating Discrete Distributions 
We now consider the question of how to simulate from discrete distributions. 
EXAMPLE 2.10.2 The Bernoulli(O) Distribution 


Suppose we want to generate X ~ Bernoulli(@), where 0 < 0 < 1. We can simply set 


J1 Ui <0 
zel Ui > 0. 


Then clearly, we always have either X = 0 or X = 1. Furthermore, P(X = 1) = 
P(U, < 0) = 0, because U; ~ Uniform[0, 1]. Hence, we see that X ~ Bernoulli (0). 
| 


EXAMPLE 2.10.3 The Binomial(n, 0) Distribution 
Suppose we want to generate Y ~ Binomial(n, 0), where 0 < 0 < 1 andn > 1. There 
are two natural methods for doing this. 


118 Section 2.10: Simulating Probability Distributions 


First, we can simply define Y as follows: 
Í fn 
Y = min{j : >( ora — 6)" > Uj}. 
k=0 k 
That is, we let Y be the largest value of j such that the sum of the binomial probabilities 
up to j — 1 is still no more than U4. In that case, 
Yo] (ny) 9k n-k 
10 (1)0 (1 -@) < Uj 
PY=y) = P KOP 7 
and Dy (0d — 4)" * > U 


E (n k n—k afn k n—k 
P > (i) (1 —0) <u sY (i) (1—0) 
k=0 


k=0 


= 5 ý a-o -5 (Nota -o 
= k k 


k=0 k=0 


("ora 0". 


Hence, we have Y ~ Binomial(n, 0), as desired. 
Alternatively, we can set 


fori = 1,2,3,.... Then, by Example 2.10.2, we have X; ~ Bernoulli (0) for each i, 
with the {X; } independent because the {U;} are independent. Hence, by the observation 
at the end of Example 2.3.3, if we set Y = X1 +--- + Xn, then we will again have 
Y ~ Binomial(n, 0). E 


In Example 2.10.3, the second method is more elegant and is also simpler compu- 
tationally (as it does not require computing any binomial coefficients). On the other 
hand, the first method of Example 2.10.3 is more general, as the following theorem 
shows. 


Theorem 2.10.1 Let p be a probability function for a discrete probability distri- 
bution. Let x1 < x2 < x3 < --- be all the values for which p(x) > 0. Let 
U, ~ Uniform[0, 1]. Define Y by 


j 
Y =min{y;: > pay > U1}. 
e=] 


Then Y is a discrete random variable, having probability function p. 


Chapter 2: Random Variables and Distributions 119 


PROOF | We have 


P(Y =xj) 


i-l i 
(Fre < Ui, and X` p(x) = vi) 
k=1 k=1 
i-l i 
oS ro <U < Sreo) 
k=1 k=1 


i i—l 
> pe) — >) pe = phi). 
k=1 k=1 


Also, clearly P(Y = y) = 0 if y ¢ {x1,x2,...}. Hence, for all y € R!, we have 
P(Y = y) = p(y), as desired. E 


EXAMPLE 2.10.4 The Geometric(@) Distribution 
To simulate Y ~ Geometric(@), we again have two choices. Using Theorem 2.10.1, 
we can let U; ~ Uniform[0, 1] and then set 


j . 
Y = min{j : $00 — 0) > Uj} = min{j : 1- (1 — 0y > Uj} 
k=0 


OSS logd — U1) 1) log(1 — U1) 
m : — -]l} =|] ——— ], 

7:72 og 0) log(1 — 0) 

where |r] means to round down r to the next integer value, i.e., |r] is the greatest 


integer not exceeding r (sometimes called the floor ofr). 
Alternatively, using the definition of Geometric (0) from Example 2.3.4, we can set 


SPE 1 U; <0 
x=| 4 U; > 0 


fori = 1,2,3, ... (where U; ~ Uniform[0, 1]), and then let Y = min{i : X; = 1}. 
Either way, we have Y ~ Geometric (0), as desired. E 


2.10.2 | Simulating Continuous Distributions 


We next turn to the subject of simulating absolutely continuous distributions. In gen- 
eral, this is not an easy problem. However, for certain particular continuous distribu- 
tions, it is not difficult, as we now demonstrate. 


EXAMPLE 2.10.5 The Uniform[L, R] Distribution 
We have already seen in Example 2.10.1 that if U; ~ Uniform[0, 1], and we set 


X=(R-L)U +L, 


then X ~ Uniform[Z, R]. Thus, simulating from any uniform distribution is straight- 
forward. E 


120 Section 2.10: Simulating Probability Distributions 


EXAMPLE 2.10.6 The Exponential(,) Distribution 
We have also seen, in Example 2.6.6, that if U; ~ Uniform[0, 1], and we set 


Y =ln(1/U1), 


then Y ~ Exponential(1). Thus, simulating from the Exponential(1) distribution is 
straightforward. 

Furthermore, we know from Exercise 2.6.4 that once Y ~ Exponential(1), then if 
A > 0, and we set 


Z=Y/A=m(1/U))/A, 


then Z ~ Exponential(/). Thus, simulating from any Exponential(A) distribution is 
also straightforward. E 


EXAMPLE 2.10.7 The N(u, 07) Distribution 

Simulating from the standard normal distribution, N(0, 1), may appear to be more 
difficult. However, by Example 2.9.3, if U; ~ Uniform[0, 1] and U2 ~ Uniform[0, 1], 
with U; and U2 independent, and we set 


X = 42 log(1/U1) cos2azU2), Y =./2 log(1/U1) sn(@zU2), (2.10.2) 


then X ~ N (0, 1) and Y ~ N (0, 1) (and furthermore, X and Y are independent). So, 
using this trick, the standard normal distribution can be easily simulated as well. 

It then follows from Exercise 2.6.3 that, once we have X ~ N(0, 1), if we set 
Z = 0X + u, then Z ~ N(4, o’). Hence, it is straightforward to sample from any 
normal distribution. E 


These examples illustrate that, for certain special continuous distributions, sam- 
pling from them is straightforward. To provide a general method of sampling from a 
continuous distribution, we first state the following definition. 


Definition 2.10.1 Let X be a random variable, with cumulative distribution func- 
tion F. Then the inverse cdf (or quantile function) of X is the function F~! defined 
by 


F! (t) =min{x : F(x) > t}, 


for0 <t <1. 


In Figure 2.10.1, we have provided a plot of the inverse cdf of an N (0, 1) distribu- 
tion. Note that this function goes to —oo as the argument goes to 0, and goes to oo as 
the argument goes to 1. 


Chapter 2: Random Variables and Distributions 121 


0.0 01 0.2 03 0.4 as 06 a7 08 a9 10 
Figure 2.10.1: The inverse cdf of the N (0, 1) distribution. 
Using the inverse cdf, we obtain a general method of sampling from a continuous 
distribution, as follows. 


Theorem 2.10.2 (Inversion method for generating random variables) Let F be any 
cumulative distribution function, and let U ~ Uniform[0,1]. Define a random 


variable Y by Y = F~!(U). Then P(Y < y) = F(y), ie., Y has cumulative 
distribution function given by F. 


PROOF | We begin by noting that P(Y < y) = P(F~!(U) < y). But F—!(U)is the 
smallest value x such that F(x) > U. Hence, F~!(U) < y if and only if F(y) > U, 
i.e., U < F(y). Therefore, 


P(Y < y) = P(F7} (U) < y) = P(U < F(y)). 
But 0 < F (y) < 1, and U ~ Uniform[0, 1], so P(U < F()) = F (y). Thus, 
P(Y < y) = P(U < FỌ)) = FO). 


It follows that F is the cdf of Y, as claimed. E 


We note that Theorem 2.10.2 is valid for any cumulative distribution function, whether 
it corresponds to a continuous distribution, a discrete distribution, or a mixture of the 
two (as in Section 2.5.4). In fact, this was proved for discrete distributions in Theorem 
2.10.1. 


EXAMPLE 2.10.8 Generating from an Exponential Distribution 
Let F be the cdf of an Exponential(1) random variable. Then 


X 
F(x) -f e“dt=1-e™. 
0 
It then follows that 


F'@) = min{x: F(x) > t} = min{x : 1- e™ >t} 
= min{x:x > —ln(1 —f)} = —ln(1 — t) = ln(1/(1 — t)). 


122 Section 2.10: Simulating Probability Distributions 


Therefore, by Theorem 2.10.2, if U ~ Uniform[0, 1], and we set 
Y = F7! (U) =In(1/(1-U)), (2.10.3) 


then Y ~ Exponential(1). 

Now, we have already seen from Example 2.6.6 that, if U ~ Uniform[0, 1], and we 
set Y = In(1/U), then Y ~ Exponential(1). This is essentially the same as (2.10.3), 
except that we have replaced U by 1 — U. On the other hand, this is not surprising, 
because we already know by Exercise 2.6.2 that, if U ~ Uniform[0, 1], then also 
1 — U ~ Uniform[0, 1]. E 


EXAMPLE 2.10.9 Generating from the Standard Normal Distribution 
Let © be the cdf of a N (0, 1) random variable, as in Definition 2.5.2. Then 


DT! (t) = min{x : B(x) > t}, 


and there is no simpler formula for ®~!(t). By Theorem 2.10.2, if 
U ~ Uniform[0, 1], and we set 


Y=o01!(U), (2.10.4) 


then Y ~ N(O, 1). 

On the other hand, due to the difficulties of computing with ® and ®7!, the 
method of (2.10.4) is not very practical. It is far better to use the method of (2.10.2), to 
simulate a normal random variable. E 


For distributions that are too complicated to sample using the inversion method of 
Theorem 2.10.2, and for which no simple trick is available, it may still be possible to 
do sampling using Markov chain methods, which we will discuss in later chapters, or 
by rejection sampling (see Challenge 2.10.21). 


Summary of Section 2.10 


e It is important to be able to simulate probability distributions. 

e If X is discrete, taking the value x; with probability pi, Where x1 < x2 < ---, 
and U ~ Uniform[0, 1], and Y = min{x; : DE pk > U}, then Y has the same 
distribution as X. This method can be used to simulate virtually any discrete 
distribution. 

e If F is any cumulative distribution with inverse cdf F -1 U ~ Uniform[0, 1], 
and Y = F~!(U), then Y has cumulative distribution function F. This allows 
us to simulate virtually any continuous distribution. 

e There are simple methods of simulating many standard distributions, including 
the binomial, uniform, exponential, and normal. 


Chapter 2: Random Variables and Distributions 123 


EXERCISES 


2.10.1 Let Y be a discrete random variable with P(Y = —7) = 1/2, P(Y = —2) = 
1/3, and P(Y = 5) = 1/6. Find a formula for Z in terms of U, such that if U ~ 
Uniform[0, 1], then Z has the same distribution as Y. 

2.10.2 For each of the following cumulative distribution functions F, find a formula 
for X in terms of U, such that if U ~ Uniform[0, 1], then X has cumulative distribution 
function F. 


(a) 
0 x <0 
Fx)=4 x 0<x<l 
1 x>l 
(b) 
0 x<0 
F(x) = x? 0<x<1 
x>l 
(c) 
0 x <0 
F(x) =4 x?/9 0<x<3 
1 x>3 
(d) 
0 x<l 
F(x) = 4 x?/9 is ape) 
1 x>3 
(e) 
0 x <0 
F(x) = 4 x°/32 0<x<2 
1 x>2 
(f) 
0 x <0 
_ 1/3 O0<x <7 
POEST A T<x<ll 
1 x>I1l1 


2.10.3 Suppose U ~ Uniform[0, 1], and Y = In(1/U) /3. What is the distribution of 
Y? 

2.10.4 Generalizing the previous question, suppose U ~ Uniform[0,1] and W = 
In(1/U) / A for some fixed A > 0. 

(a) What is the distribution of W? 

(b) Does this provide a way of simulating from a certain well-known distribution? 
Explain. 


124 Section 2.10: Simulating Probability Distributions 


2.10.5 Let U; ~ Uniform[0, 1] and U2 ~ Uniform[0, 1] be independent, and let X = 
c1,/log(1/U,) cos(2z U2) + c2. Find values of cı and c2 such that X ~ N(5, 9). 
2.10.6 Let U ~ Uniform[0, 1]. Find a formula for Y in terms of U, such that P (Y = 
3) = P(Y =4) = 2/5 and P(Y = 7) = 1/5, otherwise P(Y = y) = 0. 

2.10.7 Suppose P(X = 1) = 1/3, P(X = 2) = 1/6, P(X = 4) = 1/2, and 
P(X =x) = 0 otherwise. 

(a) Compute the cdf Fy(x) for all x € R!. 

(b) Compute the inverse cdf Fy } (t) for allt e R!. 

(c) Let U ~ Uniform[0, 1]. Find a formula for Y in terms of U, such that Y has cdf 
Fy. 

2.10.8 Let X have density function fy(x) = 3./x /2 for 0 < x < 1, otherwise 
fx) = 0. 

(a) Compute the cdf Fy(x) for all x € R!. 

(b) Compute the inverse cdf Fy l (t) for allt e R!. 

(c) Let U ~ Uniform[0, 1]. Find a formula for Y in terms of U, such that Y has density 
f. 

2.10.9 Let U ~ Uniform[0, 1]. Find a formula for Z in terms of U, such that Z has 
density fz(z) = 4z? for0 < z < 1, otherwise fz(z) = 0. 


COMPUTER EXERCISES 


2.10.10 For each of the following distributions, use the computer (you can use any 
algorithms available to you as part of a software package) to simulate X1, X2,..., XN 
iid. having the given distribution. (Take N = 1000 at least, with N = 10,000 or N = 
100,000 if possible.) Then compute ¥ = (1/N) XA, X; and (1/N) SX, (X: — X)’. 
(a) Uniform[0, 1] 

(b) Uniform[5, 8] 

(c) Bernoulli (1/3) 

(d) Binomial(12, 1/3) 

(e) Geometric(1/5) 

(f) Exponential(1) 

(g) Exponential(13) 

(b) N(O, 1) 

(i) NG, 9) 


PROBLEMS 


2.10.11 Let G(x) = pi Fi) + poFo(x) +--+ + pkFk(x), where p; > 0, >; pi = 
1, and F; are cdfs, as in (2.5.3). Suppose we can generate X; to have cdf F;, for 
i = 1,2,..., k. Describe a procedure for generating a random variable Y that has cdf 
G. 

2.10.12 Let X be an absolutely continuous random variable, with density given by 
fx(x) =x? for x > 1, with fy (x) = 0 otherwise. Find a formula for Z in terms of 
U, such that if U ~ Uniform[0, 1], then Z has the same distribution as X. 

2.10.13 Find the inverse cdf of the logistic distribution of Problem 2.4.18. (Hint: See 
Problem 2.5.20.) 


Chapter 2: Random Variables and Distributions 125 


2.10.14 Find the inverse cdf of the Weibull(a) distribution of Problem 2.4.19. (Hint: 
See Problem 2.5.21.) 

2.10.15 Find the inverse cdf of the Pareto(a) distribution of Problem 2.4.20. (Hint: 
See Problem 2.5.22.) 

2.10.16 Find the inverse cdf of the Cauchy distribution of Problem 2.4.21. (Hint: See 
Problem 2.5.23.) 

2.10.17 Find the inverse cdf of the Laplace distribution of Problem 2.4.22. (Hint: See 
Problem 2.5.24.) 

2.10.18 Find the inverse cdf of the extreme value distribution of Problem 2.4.23. (Hint: 
See Problem 2.5.25.) 

2.10.19 Find the inverse cdfs of the beta distributions in Problem 2.4.24(b) through 
(d). (Hint: See Problem 2.5.26.) 

2.10.20 (Method of composition) If we generate X ~ fx obtaining x, and then gener- 
ate Y from fy\x ( |x), prove that Y ~ fy. 


CHALLENGES 


2.10.21 (Rejection sampling) Suppose f is a complicated density function. Suppose g 
is a density function from which it is easy to sample (e.g., the density of a uniform or 
exponential or normal distribution). Suppose we know a value of c such that f(x) < 
cg(x) for all x € R!. The following provides a method, called rejection sampling, for 
sampling from a complicated density f by using a simpler density g, provided only 
that we know f(x) < cg(x) forall x e R!. 

(a) Suppose Y has density g. Let U ~ Uniform[0, c], with U and Y independent. 
Prove that 


b 
Pla <¥ <b|f(Y) > Ucg(¥)) = f jdr. 


(Hint: Use Theorem 2.8.1 to show that P(a < Y < b, f (Y) > cUg(Y)) = 

b 
Ja ZOPO) > cUg(Y)|Y = y)dy.) 
(b) Suppose that Y1, Y2, .. . are i.i.d., each with density g, and independently U1, U2,... 
are i.i.d. Uniform[0, c]. Let io = 0, and forn > 1, letin = min{j > in—1 : U; f (Yj) = 
cg(Y;)}. Prove that X;,, Xi,,... are i.i.d., each with density f. (Hint: Prove this for 
Xi Xi ) 


29° 
1? 


2.11 | Further Proofs (Advanced) 


Proof of Theorem 2.4.2 


We want to prove that the function ¢ given by (2.4.9) is a density function. 


126 Section 2.11: Further Proofs (Advanced) 


Clearly é(x) > 0 for all x. To proceed, we set J = J d(x) dx. Then, using 
multivariable calculus, 


(Eroe) (Ero) (0) 


Bo fo PO fiOO: 7 22 
f ih d(x) b(y) dx dy z. f — et)? dx dy. 
—00 J —00 —00 J—co 2m 


We now switch to polar coordinates (r, 0), so that x = rcos@ and y = rsin@, 
wherer > 0 and 0 < @ < 2x. Then x? + y? = r? and, by the multivariable change of 
variable theorem from calculus, dx dy =r dr d0. Hence, 


2m œ | 3 oe) 5 
i. f —e™ lr dr d0 = eT lr dr 
o Jo 2m 0 


=o) r=0o0 
= -2 |= (-0)-(-D=1, 


rP 


rP 


and we have J? = 1. But clearly J > 0 (because ¢ > 0), so we must have J = 1, as 
claimed. E 


Proof of Theorem 2.6.2 


We want to prove that, when X is an absolutely continuous random variable, with den- 
sity function fy and Y = h(X), where h : R! — R! is a function that is differentiable 
and strictly increasing, then Y is also absolutely continuous, and its density function 
fy is given by 

FO) = fx") / GO) (2.11.1) 


where h' is the derivative of h, and where h™! (y) is the unique number x such that 
h(x) = y. 
We must show that whenever a < b, we have 


b 
Pa <Y <b) -f ON: 


where fy is given by (2.11.1). To that end, we note that, because h is strictly increasing, 
so is h—!. Hence, applying h—! preserves inequalities, so that 


P(a<Y <b) = P'@) <h! (Y) < h!) =PA'@<X <h'()) 


h! (b) 
J Sx(x) dx. 
h~! (a) 


We then make the substitution y = h(x), so that x = hT! (y), and 


d 
dx = (faoa. 
dy 


Chapter 2: Random Variables and Distributions 127 
But by the inverse function theorem from calculus, Zh (vy) = 1/h'(h7!(y)). Fur- 


thermore, as x goes from 47! (a) to hT! (b), we see that y = h(x) goes from a to b. 
We conclude that 


P(a<Y<b) 


h! (b) b 
fia KORS S KOOT ODDA 


f Oe 


as required. E 


Proof of Theorem 2.6.3 


We want to prove that when X is an absolutely continuous random variable, with den- 
sity function fy and Y = h(X), where h : R! — R! is a function that is differentiable 
and strictly decreasing, then Y is also absolutely continuous, and its density function 
fy may again be defined by (2.11.1). 

We note that, because h is strictly decreasing, so is h~!. Hence, applying AT! 
reverses the inequalities, so that 


P(h!(b) < A'Y) <h'(@)) = Ph |b) < X <h'@) 
h7! (a) 

f, x04 

h-'(b) 


-l(b 


P(a<Y <b) 


We then make the substitution y = h(x), so that x = hT! (y), and 
d 
ax =|F9on|ay. 
dy 
But by the inverse function theorem from calculus, 


d 1 
Gee = 
dy” = WER" 


Furthermore, as x goes from A7! (b) to h7! (a), we see that y = h(x) goes from a to b. 
We conclude that 


ha) b 
pasv<h = f, feeyae= [feo AWATOA 


= f Wore 


as required. E 


128 Section 2.11: Further Proofs (Advanced) 


Proof of Theorem 2.9.2 


We want to prove the following result. Let X and Y be jointly absolutely continuous, 
with joint density function fx y. Let Z = hı(X, Y) and W = h2(X, Y), where hi, h2: 
R? — R! are differentiable functions. Define the joint function h = (h1, h2) : R? > 
R? by 

A(x, y) = (Ai (x, y), ho, y)). 


Assume that h is one-to-one, at least on the region {(x, y) : f(x, y) > 0}, i.e., if 
Ay(x1, y1) = hı (x2, y2) and h2(x1, y1) = h2(x2, y2), then xı = x2 and yı = y2. Then 
Z and W are also jointly absolutely continuous, with joint density function fz, w given 
by 
fz,w@,w) = fx, y (hE, w)) / I6 E, w), 

where J is the Jacobian derivative of h, and where h™! (z, w) is the unique pair (x, y) 
such that h(x, y) = (z, w). 

We must show that whenever a < b and c < d, we have 


d prb 
Pa<Zsb e<sWsd=] | fzwG,w)dwds, 


If we let S = [a, b] x [c, d] be the two-dimensional rectangle, then we can rewrite this 
as 


P((Z, mes=f | fave w)dzdw. 


Now, using the theory of multivariable calculus, and making the substitution (x, y) = 
h—'(z, w) (which is permissible because h is one-to-one), we have 


J | fre wdzav 

= f | (fre. wA E wl) de dw 
S 

=f f rD ED) VG lds dy 
h=! (S) 


= J | gP ED ddy = PLY) ES) 
= P(h7!(Z, W) € h(S) = P((Z, W) € S), 


as required. E 


3.1 


Chapter 3 
Expectation 


CHAPTER OUTLINE 


Section 1 The Discrete Case 

Section 2 The Absolutely Continuous Case 
Section 3 Variance, Covariance, and Correlation 
Section 4 Generating Functions 

Section 5 Conditional Expectation 

Section 6 Inequalities 

Section 7 General Expectations (Advanced) 
Section 8 Further Proofs (Advanced) 


In the first two chapters we learned about probability models, random variables, and 
distributions. There is one more concept that is fundamental to all of probability theory, 
that of expected value. 

Intuitively, the expected value of a random variable is the average value that the 
random variable takes on. For example, if half the time X = 0, and the other half of 
the time X = 10, then the average value of X is 5. We shall write this as E(X) = 5. 
Similarly, if one-third of the time Y = 6 while two-thirds of the time Y = 15, then 
E(Y) = 12. 

Another interpretation of expected value is in terms of fair gambling. Suppose 
someone offers you a ticket (e.g., a lottery ticket) worth a certain random amount X. 
How much would you be willing to pay to buy the ticket? It seems reasonable that you 
would be willing to pay the expected value E(X) of the ticket, but no more. However, 
this interpretation does have certain limitations; see Example 3.1.12. 

To understand expected value more precisely, we consider discrete and absolutely 
continuous random variables separately. 


The Discrete Case 


We begin with a definition. 


129 


130 Section 3.1: The Discrete Case 


Definition 3.1.1 Let X be a discrete random variable. Then the expected value (or 
mean value or mean) of X, written E(X) (or u y), is defined by 


E(X)= Y aP sr) Dore): 


xeR! xeR! 


We will have P(X = x) = 0 except for those values x that are possible values of X. 
Hence, an equivalent definition is the following. 


Definition 3.1.2 Let X be a discrete random variable, taking on distinct values 
X1,%2,... , with p; = P(X = xi). Then the expected value of X is given by 


E(X) = È xi pi. 


The definition (in either form) is best understood through examples. 


EXAMPLE 3.1.1 
Suppose, as above, that P(X = 0) = P(X = 10) = 1/2. Then 


EX) = (0)0/2) + 00)(1/2) =5, 


as predicted. E 

EXAMPLE 3.1.2 

Suppose, as above, that P(Y = 6) = 1/3, and P(Y = 15) = 2/3. Then 
E(Y) = (6)(1/3) + (15)(2/3) = 2 + 10 = 12, 


again as predicted. E 


EXAMPLE 3.1.3 
Suppose that P(Z = —3) = 0.2, and P (Z = 11) = 0.7, and P (Z = 31) = 0.1. Then 


E(Z) = (—3)(0.2) + (11)(0.7) + (31)(0.1) = —0.6 + 7.7 +3.1 = 10.2.08 


EXAMPLE 3.1.4 
Suppose that P(W = —3) = 0.2, and P(W = —11) = 0.7, and P(W = 31) = 0.1. 
Then 


E(W) = (—3)(0.2) + (-11)(0.7) + 81)0.1) = —0.6 — 7.7 +3.1 = —5.2. 


In this case, the expected value of W is negative. E 


We thus see that, for a discrete random variable X, once we know the probabilities that 
X = x (or equivalently, once we know the probability function p y), it is straightfor- 
ward (at least in simple cases) to compute the expected value of X. 

We now consider some of the common discrete distributions introduced in Sec- 
tion 2.3. 


Chapter 3: Expectation 131 


EXAMPLE 3.1.5 Degenerate Distributions 
If X =c is a constant, then P(X = c) = 1, so 


E(X)=(©)() =c, 
as it should. 


EXAMPLE 3.1.6 The Bernoulli(@) Distribution and Indicator Functions 
If X ~ Bernoulli (0), then P(X = 1) = 8 and P(X = 0) = 1 — 9, so 


E(X)=()@)+ Od -0) = 


As a particular application of this, suppose we have a response s taking values in a 
sample S and A C S. Letting X (s) = I4 (s), we have that X is the indicator function 
of the set A and so takes the values 0 and 1. Then we have that P(X = 1) = P(A), 
and so X ~ Bernoulli(P (A)) . This implies that 


E (X) = E (I4) = P(A). 


Therefore, we have shown that the expectation of the indicator function of the set A is 
equal to the probability of 4. E 


EXAMPLE 3.1.7 The Binomial(n, 0) Distribution 
If Y ~ Binomial(n, 0), then 


PY =k) = (a — oy” 


fork =0,1,...,n. Hence, 

EY) = De PY =n-Yaf a-o 
= On! k _ gyn-k gk pnk 
= payee a 6) => een pie 0) 


n(n k ON Voa k -k 
= 6*(1 —6)"— ed -ay"™. 
B Taa e 
Now, the binomial theorem says that for any a and b and any positive integer m, 
M m ; 2 
(a +b)” = > ‘ators . 
j=0 J 
Using this, and setting 7 = k — 1, we see that 


n n—1 ‘es al NN et 
E(Y) dali ea =o) t= Son : Joa =o) Jal 


j= 


nO > E wa- OTIT! = n0 Q +1- 0y"-") =n0. 


132 


Section 3.1: The Discrete Case 


Hence, the expected value of Y is n@. Note that this is precisely n times the ex- 
pected value of X, where X ~ Bernoulli(@) as in Example 3.1.6. We shall see in 
Example 3.1.15 that this is not a coincidence. E 


EXAMPLE 3.1.8 The Geometric(@) Distribution 
If Z ~ Geometric(6), then P (Z = k) = (1 — 0)*6 for k = 0, 1,2, ... . Hence, 


EZ = Sk — 06. 
k=0 


Therefore, we can write 


(EZ =S ea oto, 
€=0 


Using the substitution k = £ + 1, we compute that 


(1 —0)E(Z) = Sk - 0-00. 
k=1 


Subtracting (3.1.2) from (3.1.1), we see that 


0E(Z) 


(E(Z)) — (1 - 9 E(Z)) = Dk — (k - 1) (1 - 00 


k=1 


ge 1-0 
a) = —— 6 = | —9. 
DOS -A 


Hence, 0E (Z) = 1 — 0, and we obtain E (Z) = (1 — 0)/0. B 


EXAMPLE 3.1.9 The Poisson(A) Distribution 
If X ~ Poisson(A), then P(X = k) = eA Ak sk! fork = 0,1,2,.... Hence, setting 


=ke 


E(X) 


5 _ Ak S ‘ us >> jel 
ke* — = e* = he~ 
— k! PEE (k — 1)! = (k —1)! 
rea dé 4 
—À = —A A 
he È T he “e A, 


and we conclude that E (X) = 4.48 


(3.1.1) 


(3.1.2) 


It should be noted that expected values can sometimes be infinite, as the following 
example demonstrates. 


EXAMPLE 3.1.10 


Let X be a discrete random variable, with probability function py given by 


px") =2" 


Chapter 3: Expectation 133 


for k = 1,2,3,..., with py(x) = 0 for other values of x. That is, py(2) = 1/2, 
px(4) = 1/4, px(8) = 1/8, etc., while px(1) = px(3) = px) = px(6) =--- =0. 

Then it is easily checked that p y is indeed a valid probability function (i.e., py(x) > 
0 for all x, with X, px(x) = 1). On the other hand, we compute that 


EX) = Ve) = $0) =. 
k=] k=1 


We therefore say that E(X) = 00, i.e., that the expected value of X is infinite. E 
Sometimes the expected value simply does not exist, as in the following example. 


EXAMPLE 3.1.11 
Let Y be a discrete random variable, with probability function py given by 


1/2y y =2,4, 8, 16,... 
0 otherwise. 


That is, py(2) = py(—2) = 1/4, py(4) = py(—4) = 1/8, py) = py(-8) = 
1/16, etc. Then it is easily checked that py is indeed a valid probability function (i.e., 


py) > 0 for all y, with Zy Py(y) = 1). 
On the other hand, we compute that 


EY) Sy pro) = X50225) + X 2)01/22%) 
y k=1 k=1 


520/2 — $0/2 = œ — 00, 
k=l k=1 


which is undefined. We therefore say that E (Y) does not exist, i.e., that the expected 
value of Y is undefined in this case. E 


EXAMPLE 3.1.12 The St. Petersburg Paradox 
Suppose someone makes you the following deal. You will repeatedly flip a fair coin 
and will receive an award of 2% pennies, where Z is the number of tails that appear 
before the first head. How much would you be willing to pay for this deal? 

Well, the probability that the award will be 27 pennies is equal to the probability that 
you will flip z tails and then one head, which is equal to 1/27+!. Hence, the expected 
value of the award (in pennies) is equal to 


Se 224) = > 1/2 = œ. 
z=0 z=0 


In words, the average amount of the award is infinite! 

Hence, according to the “fair gambling” interpretation of expected value, as dis- 
cussed at the beginning of this chapter, it seems that you should be willing to pay an 
infinite amount (or, at least, any finite amount no matter how large) to get the award 


134 Section 3.1: The Discrete Case 


promised by this deal! How much do you think you should really be willing to pay for 

it?) 0 

EXAMPLE 3.1.13 The St. Petersburg Paradox, Truncated 

Suppose in the St. Petersburg paradox (Example 3.1.12), it is agreed that the award will 

be truncated at 239 cents (which is just over $10 million!). That is, the award will be 

the same as for the original deal, except the award will be frozen once it exceeds 230 

cents. Formally, the award is now equal to 2™!"G9.2) pennies, where Z is as before. 
How much would you be willing to pay for this new award? Well, the expected 

value of the new award (in cents) is equal to 


oo 30 oo 
RO = Se V0 /27) + D1) /27*") 
ge] z=1 z=31 
30 


= $ 0/D + 2°)0/2?") = 31/2 = 15.5. 
z=l 


That is, truncating the award at just over $10 million changes its expected value enor- 
mously, from infinity to less than 16 cents! E 


In utility theory, it is often assumed that each person has a utility function U such 
that, if they win x cents, their amount of “utility” (i.e., benefit or joy or pleasure) is 
equal to U(x). In this context, the truncation of Example 3.1.13 may be thought of 
not as changing the rules of the game but as corresponding to a utility function of the 
form U(x) = min(x, 2°). In words, this says that your utility is equal to the amount 
of money you get, until you reach 27° cents (approximately $10 million), after which 
point you don’t care about money” anymore. The result of Example 3.1.13 then says 
that, with this utility function, the St. Petersburg paradox is only worth 15.5 cents to 
you — even though its expected value is infinite. 

We often need to compute expected values of functions of random variables. For- 
tunately, this is not too difficult, as the following theorem shows. 


Theorem 3.1.1 
(a) Let X be a discrete random variable, and let g : R! > R! be some function 
such that the expectation of the random variable g(X) exists. Then 


E (g(X)) = $ g@&) PQ =x). 


(b) Let X and Y be discrete random variables, and let h : R? —> R! be some 
function such that the expectation of the random variable A(X, Y) exists. Then 


E A(X, Y)) = $ h, PX =x, Y =y). 
x,y 


1 When one of the authors first heard about this deal, he decided to try it and agreed to pay $1. In fact, he 
got four tails before the first head, so his award was 16 cents, but he still lost 84 cents overall. 
20r, perhaps, you think it is unlikely you will be able to collect the money! 


Chapter 3: Expectation 135 


PROOF | We prove part (b) here. Part (a) then follows by simply setting A(x, y) = 
g(x) and noting that 


> 2) PO =x, Y =)=) eo) P =x). 


x,y 


Let Z = h (X, Y). We have that 


D =z)= D Y) =z) 


= = pues, y= y= Dd 7PX a2, ¥= y) 


T Pa Z 2Y z= A(x, Y) 


Sre, y) PX =x, Y =y), 
x,y 


E(Z) 


as claimed. E 


One of the most important properties of expected value is that it is /inear, stated as 
follows. 


Theorem 3.1.2 (Linearity of expected values) Let X and Y be discrete random 
variables, let a and b be real numbers, and put Z = aX + bY. Then E(Z) = 


aE(X) + bE(Y). 


PROOF | Let py,y be the joint probability function of X and Y. Then using Theo- 
rem 3.1.1, 


E(Z) = 2 x + by) px yC, y= a J x xr Y+d dy pares y) 
>> Dprv(e, DH 2 pxre, y). 


Because >’, px,y Œ, y) = px(x) and X'y Px,y Œ, y) = py), we have that 


E(Z) =a X x px@) +b > y prO) = aE(X) + DEY), 
x y 


as claimed. E 


EXAMPLE 3.1.14 
Let X ~ Binomial(n, 01), and let Y ~ Geometric(@2). What is E (3X — 2Y)? 

We already know (Examples 3.1.6 and 3.1.7) that E(X) = n 0; and E(Y) = (1 — 
62) / 02. Hence, by Theorem 3.1.2, E(3.X — 2Y) = 3E(X) —2E(Y) = 3nd; —20 — 
02)/02.8 
EXAMPLE 3.1.15 


Let Y ~ Binomial(n, 0). Then we know (cf. Example 2.3.3) that we can think of 
Y = Xı +--+ Xn, where each X; ~ Bernoulli (0) (in fact, X; = 1 if the ith coin is 


136 Section 3.1: The Discrete Case 


heads, otherwise X; = 0). Because E (X;) = 0 for each i, it follows immediately from 
Theorem 3.1.2 that 


E(Y) = E(X1) +--+» + E(Xn) =O +- +0 =nð. 
This gives the same answer as Example 3.1.7, but much more easily. E 


Suppose that X is a random variable and Y = c is a constant. Then from Theorem 
3.1.2, we have that E (X +c) = E(X) +c. From this we see that the mean value u y 
of X is a measure of the /ocation of the probability distribution of X. For example, if 
X takes the value x with probability p and the value y with probability 1 — p, then the 
mean of X is u y = px +(1—p)y, which is a value between x and y. For a constant c, 
the probability distribution of X + c is concentrated on the points x +c and y +c, with 
probabilities p and 1 — p, respectively. The mean of X +c is u y +c, which is between 
the points x + c and y + c, i.e., the mean shifts with the probability distribution. It is 
also true that if X is concentrated on the finite set of points xj < x2 < --- < xx, then 
xı < py < xk, and the mean shifts exactly as we shift the distribution. This is depicted 
in Figure 3.1.1 for a distribution concentrated on k = 4 points. Using the results of 
Section 2.6.1, we have that py4¢(x) = px(x — c). 


Px 4 
bg 
° f 
a 
EX) 
i i ° : > 
X] X2 X3 X4 
PX+c & 
r 
: 
: : ? 
EX+c) | : 
Cd : — 
xı+tc Aare x3+¢C X4te 


Figure 3.1.1: The probability functions and means of discrete random variables X and X + c. 


Theorem 3.1.2 says, in particular, that E(X + Y) = E(X) + E(Y), i.e., that ex- 
pectation preserves sums. It is reasonable to ask whether the same property holds for 
products. That is, do we necessarily have E(XY) = E(X)E(Y)? In general, the 
answer is no, as the following example shows. 


Chapter 3: Expectation 137 


EXAMPLE 3.1.16 
Let X and Y be discrete random variables, with joint probability function given by 


1/2 x=3,y=5 
1/6 x=3,y=9 
Pxy,y) = 1/6 x=6, y=5 
1/6 x=6,y=9 
0 otherwise. 
Then 
E(X) = > xP =x) = (3)(1/2 + 1/6) + (6)(1/6 + 1/6) = 4 
and 


EY) = Dy PY =y) = (5)(1/2 + 1/6) + (0)(1/6 + 1/6) = 19/3, 
J. 


while 
E(XY) = $ zPY =2) 
= (3)6)C/2) + B)O)C/6) + (6)56)C/6) + (6))C/6) 
26. 


Because (4)(19/3) 4 26, we see that E (X) E(Y) # E(XY) in this case. E 
On the other hand, if X and Y are independent, then we do have E(X)E(Y) = 
E(XY). 


Theorem 3.1.3 Let X and Y be discrete random variables that are independent. 
Then E(XY) = E(X)E (Y). 


PROOF | Independence implies (see Theorem 2.8.3) that P(X = x, Y = y) = 
P(X =x) P(Y = y). Using this, we compute by Theorem 3.1.1 that 


EQXXY) = $ yP =x, Y=y) =) xy P(X =x) PY =v) 
x,y x,y 


= (= ra=») (© PY =») = E(X)E(Y), 
x y 


as claimed. E 
Theorem 3.1.3 will be used often in subsequent chapters, as will the following impor- 
tant property. 


Theorem 3.1.4 (Monotonicity) Let X and Y be discrete random variables, and 
suppose that X < Y. (Remember that this means X (s) < Y (s) for all s € S.) Then 


E(X) < E(Y). 


138 Section 3.1: The Discrete Case 


PROOF | Let Z = Y — X. Then Z is also discrete. Furthermore, because X < Y, 
we have Z > 0, so that all possible values of Z are nonnegative. Hence, if we list the 
possible values of Z as z},22,..., then z; > 0 for all i, so that 


E(Z) = S zP(Z =z;) > 0. 


L 


But by Theorem 3.1.2, E(Z) = E (Y) — E(X). Hence, E (Y) — E(X) > 0, so that 
E(Y) > E(X)." 


Summary of Section 3.1 


e The expected value E(X) of a random variable X represents the long-run average 
value that it takes on. 

e If X is discrete, then E(X) = >. x P(X =x). 

e The expected values of the Bernoulli, binomial, geometric, and Poisson distrib- 
utions were computed. 


Expected value has an interpretation in terms of fair gambling, but such interpre- 
tations require utility theory to accurately reflect human behavior. 

Expected values of functions of one or two random variables can also be com- 
puted by summing the function values times the probabilities. 


Expectation is linear and monotone. 


e If X and Y are independent, then E (XY) = E(X) E(Y). But without indepen- 
dence, this property may fail. 


EXERCISES 


3.1.1 Compute E(X) when the probability function of X is given by each of the fol- 


lowing. 
(a) 
1/7 x = —4 
ee ee x=0 
PX) = 477 x=3 
0 otherwise 
(b) 
"E x=0,1,2,... 
PX) =} 0 otherwise 
(c) 
re 2-710 x =7,8,9,... 
px) = 0 otherwise 


Chapter 3: Expectation 139 


3.1.2 Let X and Y have joint probability function given by 


1/7 x=5, y=0 
1/7 =S = 3 
1/7 eH, =4 

Px,y(®,y) = se x =8 a 
1/7 x=8, y=4 
0 otherwise, 


as in Example 2.7.5. Compute each of the following. 

(a) EX) 

b) EY) 

(c) EBX +7Y) 

(d) EX?) 

© EY’) 

(f) E(XY) 

(g) E(XY +14) 

3.1.3 Let X and Y have joint probability function given by 


1/2 HER, y=10 
1/6 x=-7, y=10 
1/12 2. r= NP 

pxy@y)=4 1/12 x=-7, y=12 
1/12 x=2, y=14 
1/12 x=-7,y=14 
0 otherwise. 


Compute each of the following. 

(a) EX) 

b) EY) 

(c) E(X?) 

(d) E?) 

(e) E(X + Y?) 

(f) E(XY — 4Y) 

3.1.4 Let X ~ Bernoulli (61) and Y ~ Binomial (n, 02). Compute E (4X — 3Y). 

3.1.5 Let X ~ Geometric(0) and Y ~ Poisson(4). Compute E (8X — Y + 12). 

3.1.6 Let Y ~ Binomial(100, 0.3), and Z ~ Poisson(7). Compute E (Y + Z). 

3.1.7 Let X ~ Binomial(80, 1/4), and let Y ~ Poisson(3/2). Assume X and Y are 
independent. Compute E (XY). 

3.1.8 Starting with one penny, suppose you roll one fair six-sided die and get paid an 
additional number of pennies equal to three times the number showing on the die. Let 
X be the total number of pennies you have at the end. Compute E (X). 

3.1.9 Suppose you start with eight pennies and flip one fair coin. If the coin comes up 
heads, you get to keep all your pennies; if the coin comes up tails, you have to give 
half of them back. Let X be the total number of pennies you have at the end. Compute 


E(X). 


140 Section 3.1: The Discrete Case 


3.1.10 Suppose you flip two fair coins. Let Y = 3 if the two coins show the same 
result, otherwise let Y = 5. Compute £ (Y). 

3.1.11 Suppose you roll two fair six-sided dice. 

(a) Let Z be the sum of the two numbers showing. Compute F(Z). 

(b) Let W be the product of the two numbers showing. Compute E (W). 

3.1.12 Suppose you flip one fair coin and roll one fair six-sided die. Let X be the 


product of the numbers of heads (i.e., 0 or 1) times the number showing on the die. 
Compute E(X). (Hint: Do not forget Theorem 3.1.3.) 


3.1.13 Suppose you roll one fair six-sided die and then flip as many coins as the num- 
ber showing on the die. (For example, if the die shows 4, then you flip four coins.) Let 
Y be the number of heads obtained. Compute E (Y). 

3.1.14 Suppose you roll three fair coins, and let X be the cube of the number of heads 
showing. Compute E(X). 


PROBLEMS 


3.1.15 Suppose you start with one penny and repeatedly flip a fair coin. Each time you 
get heads, before the first time you get tails, you get two more pennies. Let X be the 
total number of pennies you have at the end. Compute E (X). 

3.1.16 Suppose you start with one penny and repeatedly flip a fair coin. Each time you 
get heads, before the first time you get tails, your number of pennies is doubled. Let X 
be the total number of pennies you have at the end. Compute E (X). 

3.1.17 Let X ~ Geometric(@), and let Y = min(X, 100). 

(a) Compute E (Y). 

(b) Compute E(Y — X). 

3.1.18 Give an example of a random variable X such that E (min(X, ; 100)) = E(X). 
3.1.19 Give an example of a random variable X such that E (min(X, 100)) = E (X)/2. 
3.1.20 Give an example of a joint probability function pxy,y for random variables X 
and Y, such that X ~ Bernoulli(1/4) and Y ~ Beroulli(1/2), but E (XY) 4 1/8. 
3.1.21 For X ~ Hypergeometric(N, M, n), prove that E (X) =nM/N. 

3.1.22 For X ~ Negative-Binomial (r, 0), prove that E(X) = r(1 — 0)/0. (Hint: 
Argue that if X),..., X, are independent and identically distributed Geometric (0) , 
then X = X1 +---+ X, ~ Negative-Binomial(r, 0) .) 

3.1.23 Suppose that (X1, X2, X3) ~ Multinomial(”, 01, 02, 03) . Prove that 

E(X;) =n6;. 


CHALLENGES 


3.1.24 Let X ~ Geometric(@). Compute E (X°). 


3.1.25 Suppose X is a discrete random variable, such that E (min(X, M)) = E(X). 
Prove that P(X > M) = 0. 


Chapter 3: Expectation 141 


DISCUSSION TOPICS 


3.1.26 How much would you be willing to pay for the deal corresponding to the 
St. Petersburg paradox (see Example 3.1.12)? Justify your answer. 


3.1.27 What utility function U (as in the text following Example 3.1.13) best describes 
your own personal attitude toward money? Why? 


3.2 | The Absolutely Continuous Case 


Suppose now that X is absolutely continuous, with density function fy. How can 
we compute E(X) then? By analogy with the discrete case, we might try computing 
>. x P(X = x), but because P(X = x) is always zero, this sum is always zero as 
well. 

On the other hand, if € is a small positive number, then we could try approximating 
E(X) by 

E(X) © J ie Pie < X < (i + No), 
L 


where the sum is over all integers i. This makes sense because, if € is small and 


ie < X < (i + l)e, then X 7 ie. 
Now, we know that 


(i+1)e 
P(ie <x<#ü+D9= f fxx)dx. 


This tells us that 
(i+1)e 
rwx | ie fx(x) dx. 
i tE 


Furthermore, in this integral, ie < x < (i + 1)e. Hence, i€ ~ x. We therefore see that 


EWE a fx(e)dx = f 


(00) 


x fx(x)dx. 


This prompts the following definition. 


Definition 3.2.1 Let X be an absolutely continuous random variable, with density 
function fy. Then the expected value of X is given by 


E(X) = ie AOE 


From this definition, it is not too difficult to compute the expected values of many 
of the standard absolutely continuous distributions. 


EXAMPLE 3.2.1 The Uniform[0, 1] Distribution 
Let X ~ Uniform[0, 1] so that the density of X is given by 


1 O0<x<l 
Jx&) = | 0 otherwise. 


142 Section 3.2: The Absolutely Continuous Case 


Hence, 
ee) 1 
BX) = f sodz = f xdx = 
—0oo 0 
as one would expect. E 


EXAMPLE 3.2.2 The Uniform[L, R] Distribution 
Let X ~ Uniform[L, R] so that the density of X is given by 


1/(R — L) L<x<R 
0 otherwise. 


Ix@) = | 


Hence, 


EX) 


oo R 1 x2 XER 
= —— d — SÁÁ 
J feeoas fem x RE) leat 
R?-L? (R-LXR+L) R+L 


MRA. AR — 2?’ 
again as one would expect. E 


EXAMPLE 3.2.3 The Exponential(A) Distribution 
Let Y ~ Exponential(,) so that the density of Y is given by 


_ [| 2e y20 
ro=l 4 ae 


Hence, integration by parts, with u = y and dv = Ae~’” (so du = dy, vo = —e~””), 
leads to 


oo oo ae 6 
EW) = | yfrordy= fo pie Mdy=—ye I+ | dy 
Co 


oo —Ay 
Z -4 q Sait 
i far il 


™ 0-1 1 


oe le A 


In particular, if A = 1, then Y ~ Exponential(1) and E(Y) = 1.8 


EXAMPLE 3.2.4 The N (0, 1) Distribution 
Let Z ~ N (0, 1) so that the density of Z is given by 


1 2 
fz) =¢(@) = m k, 


Hence, 


E(Z) f z fz(z)dz 


ie Zz 1 ep dz 
-œ V2z 


[ bsp Pig +f 124 (3.2.1) 
= 7 =p Z 2 Z; Le 
V2 0 ~ 2T 


—00 


Chapter 3: Expectation 143 


But using the substitution w = —z, we see that 
0 oo 
1 a 1 2 
gS [2d =i (— ———— — p—w*/2 d 
Z e z= w e w. 
T sN 2m 0 ) N 2m 


Then the two integrals in (3.2.1) cancel each other out, and leaving us with E£ (Z) = 0. 
I 


As with discrete variables, means of absolutely continuous random variables can 
also be infinite or undefined. 


EXAMPLE 3.2.5 
Let X have density function given by 


aie x>1l 
TES | 0 otherwise. 


Then 
E(X) a. x fx(x)dx afa (1/x?)dx = f awas = logx z = 00. 
—oo 1 1 x= 


Hence, the expected value of X is infinite. E 


EXAMPLE 3.2.6 
Let Y have density function given by 


1/2y? ye 
fr) = 4 1/27 y<-l 
0 otherwise. 


Then 


love) wt oo 
EY”) = i, y frody = f vady + f ID 


—00 


= -f ady f ee 
1 1 


which is undefined. Hence, the expected value of Y is undefined (i.e., does not exist) 
in this case. E 


Theorem 3.1.1 remains true in the continuous case, as follows. 


144 Section 3.2: The Absolutely Continuous Case 


Theorem 3.2.1 
(a) Let X be an absolutely continuous random variable, with density function fy, 
and let g : R! —> R! be some function. Then when the expectation of g(X) exists, 


o0 


&(x) fix(x) dx. 


E(x) = | 


(b) Let X and Y be jointly absolutely continuous random variables, with joint den- 
sity function fx,y, and let h : R? — R! be some function. Then when the expecta- 
tion of h(X, Y) exists, 


Emx.Y) = | i E een. 


We do not prove Theorem 3.2.1 here; however, we shall use it often. For a first use 
ofthis result, we prove that expected values for absolutely continuous random variables 
are still linear. 


Theorem 3.2.2 (Linearity of expected values) Let X and Y be jointly absolutely 


continuous random variables, and let a and b be real numbers. Then E (aX +bY) = 
aE(X)+bE(Y). 


PROOF | Let /x,y be the joint density function of X and Y. Then using Theo- 
rem 3.2.1, we compute that 


E(Z) Í f EESE R T 


[0,0] CO [oe] [o-e) 
af f fy G9) dxdy +b | f EET 
= —Cco =o: =e 


T ea 
TE (f_n dx) dy. 


But f°. fx,y Œ, y) dy = fx(x) and f°. fx,y Œ, y) dx = fy (y), so 


EZ) =a | sfetsydx +b f vfr) dy = aE(X) + bE), 


as claimed. E 


Just as in the discrete case, we have that E(X + c) = E(X) +c for an absolutely 
continuous random variable X. Note, however, that this is not implied by Theorem 
3.2.2 because the constant c is a discrete, not absolutely continuous, random variable. 
In fact, we need a more general treatment of expectation to obtain this result (see Sec- 
tion 3.7). In any case, the result is true and we again have that the mean of a random 


Chapter 3: Expectation 145 


variable serves as a measure of the location of the probability distribution of X. In 
Figure 3.2.1, we have plotted the densities and means of the absolutely continuous ran- 
dom variables X and X + c. The change of variable results from Section 2.6.2 give 


Sxtc(x) = fx œ- c). 


txt 
E(X) 
X 
Ixte + 
PA E(X+c) 
X 


Figure 3.2.1: The densities and means of absolutely continuous random variables X and 
X +c. 


EXAMPLE 3.2.7 The N (u, 07) Distribution 

Let X ~ N(u, 07). Then we know (cf. Exercise 2.6.3) that if Z = (X — u) /o, then 

Z ~ N(O, 1). Hence, we can write X = u + o Z, where Z ~ N(0, 1). But we know 

(see Example 3.2.4) that E(Z) = 0 and (see Example 3.1.5) that E(w) = u. Hence, 

using Theorem 3.2.2, E(X) = E(ut+oZ)=E(u)t+ok(Z=yu+o00)=u.4 
If X and Y are independent, then the following results show that we again have 

E(XY) = E(X) E (Y). 


Theorem 3.2.3 Let X and Y be jointly absolutely continuous random variables that 
are independent. Then E (XY) = E(X)E(Y). 


PROOF | Independence implies (Theorem 2.8.3) that fx y@&, y) = fx&) fro). 


Using this, along with Theorem 3.2.1, we compute 


E(XY) 


ff» fx, yœ, y)dx dy = | [x feo sroras dy 


(h fxts)ds) (h fey) dv) = EWEY), 


The monotonicity property (Theorem 3.1.4) still holds as well. 


as claimed. E 


146 Section 3.2: The Absolutely Continuous Case 


Theorem 3.2.4 (Monotonicity) Let X and Y be jointly continuous random vari- 


ables, and suppose that X < Y. Then E(X) < E(Y). 


PROOF | Let fy,y be the joint density function of X and Y. Because X < Y, the 
density fy, y can be chosen so that fx y(x, y) = 0 whenever x > y. Now let Z = 


Y — X. Then by Theorem 3.2.1(b), 


D= | f E ATEA 


Because fy y(x, y) = 0 whenever x > y, this implies that E (Z) > 0. But by Theo- 
rem 3.2.2, E(Z) = E (Y) — E(X). Hence, E(Y) — E(X) > 0, so that E (Y) > E(X). 
| 


Summary of Section 3.2 


e If X is absolutely continuous, then £ (X) = f x fx(x) dx. 


e The expected values of the uniform, exponential, and normal distributions were 
computed. 


e Expectation for absolutely continuous random variables is linear and monotone. 
e If X and Y are independent, then we still have E (XY) = E(X) E(Y). 


EXERCISES 


3.2.1 Compute C and E(X) when the density function of X is given by each of the 


following. 
k C 5 9 
<x < 
Ix) = | 0 otherwise 
(b) 
_ | C@+)) 6<x <8 
1a) = | 0 otherwise 
(c) i 
Cx —5 <x < —2 
Ix) = | 0 otherwise 
3.2.2 Let X and Y have joint density 
_ [| 4x2y +2y° 0<x<1,0<y<1 
JAY | 0 otherwise, 


as in Examples 2.7.6 and 2.7.7. Compute each of the following. 


(a) E(X) 
(b) EY) 


Chapter 3: Expectation 147 


(c) EBX +7Y) 

(d) E(X”) 

e) EC?) 

(f) E(XY) 

(g) E(XY +14) 

3.2.3 Let X and Y have joint density 


_ | Gxy +3x7y)/18 O<x<10<y<3 
Ixy@,y) = | 0 otherwise. 


Compute each of the following. 

(a) E(X) 

(b) E(Y) 

© E(X*) 

(d) EY?) 

(e) E(Y*) 

(f) EY?) 

3.2.4 Let X and Y have joint density 


_ | oy + @/2)x*y O0<y<x<l 
FORO = | 0 otherwise. 


Compute each of the following. 

(a) E(X) 

b) EY) 

(c) E(X?) 

(d) EY?) 

© EYS 

(f) ERP’) 

3.2.5 Let X ~ Uniform[3, 7] and Y ~ Exponential(9). Compute E (—5X — 6Y). 
3.2.6 Let X ~ Uniform[—12, —9] and Y ~ N(—8, 9). Compute E (11X + 14Y +3). 
3.2.7 Let Y ~ Exponential(9) and Z ~ Exponential (8). Compute E (Y + Z). 

3.2.8 Let Y ~ Exponential(9) and Z ~ Gamma(5, 4). Compute E (Y + Z). (You 
may use Problem 3.2.16 below.) 

3.2.9 Suppose X has density function f(x) = 3/20(x? +x?) for 0 < x < 2, otherwise 
f(x) = 0. Compute each of E (X), E(X*), and E (X°), and rank them from largest to 
smallest. 

3.2.10 Suppose X has density function f(x) = 12/7(x? + x?) for 0 < x < 1, oth- 
erwise f(x) = 0. Compute each of E(X), E (X°), and E(X?) and rank them from 
largest to smallest. 

3.2.11 Suppose men’s heights (in centimeters) follow the distribution N (174, 207), 
while those of women follow the distribution N (160, 152). Compute the mean total 
height of a man—woman married couple. 

3.2.12 Suppose X and Y are independent, with E(X) = 5 and E(Y) = 6. For each of 
the following variables Z, either compute E(Z) or explain why we cannot determine 


148 Section 3.2: The Absolutely Continuous Case 


E(Z) from the available information: 

(aj) Z=X+Y 

(b) Z = XY 

(c) Z =2X —4Y 

(d) Z=2X(3+4Y) 

(e) Z = (24+ X)34+4Y) 

(ff) Z=24+X)GX+4Y) 

3.2.13 Suppose darts are randomly thrown at a wall. Let X be the distance (in cen- 
timeters) from the left edge of the dart’s point to the left end of the wall, and let Y be 
the distance from the right edge of the dart’s point to the left end of the wall. Assume 
the dart’s point is 0.1 centimeters thick, and that E(X) = 214. Compute E (Y). 

3.2.14 Let X be the mean height of all citizens measured from the top of their head, 
and let Y be the mean height of all citizens measured from the top of their head or hat 
(whichever is higher). Must we have E(Y) > E(X)? Why or why not? 

3.2.15 Suppose basketball teams A and B each have five players and that each member 
of team A is being “guarded” by a unique member of team B. Suppose it is noticed that 
each member of team A is taller than the corresponding guard from team B. Does it 
necessarily follow that the mean height of team A is larger than the mean height of 
team B? Why or why not? 


PROBLEMS 


3.2.16 Leta > 0 and 2 > 0, and let X ~ Gamma(a, 4). Prove that E (X) = a/å. 
(Hint: The computations are somewhat similar to those of Problem 2.4.15. You will 
also need property (2.4.7) of the gamma function.) 

3.2.17 Suppose that X follows the logistic distribution (see Problem 2.4.18). Prove 
that E(X) = 0. 

3.2.18 Suppose that X follows the Weibull (a) distribution (see Problem 2.4.19). Prove 
that E(X) =T (a! +1). 

3.2.19 Suppose that X follows the Pareto(a) distribution (see Problem 2.4.20) for a > 
1. Prove that E (X) = 1/ (a — 1). What is E(X) when 0 <a < 1? 

3.2.20 Suppose that X follows the Cauchy distribution (see Problem 2.4.21). Argue 
that E(X) does not exist. (Hint: Compute the integral in two parts, where the integrand 
is positive and where the integrand is negative.) 

3.2.21 Suppose that X follows the Laplace distribution (see Problem 2.4.22). Prove 
that E(X) = 0. 

3.2.22 Suppose that X follows the Beta(a, b) distribution (see Problem 2.4.24). Prove 
that E (X) =a/(a +b). 

3.2.23 Suppose that (X1, X2) ~ Dirichlet(a;, a2, a3) (see Problem 2.7.17). Prove 
that E(X;) = a;/ (a, +a2 +43). 


Chapter 3: Expectation 149 


3.3 | Variance, Covariance, and Correlation 


Now that we understand expected value, we can use it to define various other quantities 
of interest. The numerical values of these quantities provide information about the 
distribution of random variables. 

Given a random variable X, we know that the average value of X will be E(X). 
However, this tells us nothing about how far X tends to be from E(X). For that, we 
have the following definition. 


Definition 3.3.1 The variance of a random variable X is the quantity 


ox = Var(X) = E (X - ny), 


where u y = E(X) is the mean of X. 


We note that it is also possible to write (3.3.1) as Var(X) = E (x =F w); how- 
ever, the multiple uses of “E” may be confusing. Also, because (X — u y)? is always 
nonnegative, its expectation is always defined, so the variance of X is always defined. 
Intuitively, the variance Var(X) is a measure of how spread out the distribution of 
X is, or how random X is, or how much X varies, as the following example illustrates. 


EXAMPLE 3.3.1 
Let X and Y be two discrete random variables, with probability functions 
(i= 1 x = 10 
ood eel otherwise 
and 
1/2 y=5 
pry) =} 1/2 y=15 
0 otherwise, 
respectively. 


Then E(X) = E (Y) = 10. However, 
Var(X) = (10 — 10}? (1) = 0, 
while 
Var(Y) = (5 — 10)? (1/2) + (15 — 10)? (1/2) = 25. 
We thus see that, while X and Y have the same expected value, the variance of Y is 


much greater than that of X. This corresponds to the fact that Y is more random than 
X; that is, it varies more than X does. E 


EXAMPLE 3.3.2 
Let X have probability function given by 
1/2 = 2 
1/6 x =3 
px(*)= 34 1/6 x=4 
1/6 X= 5 


0 otherwise. 


150 Section 3.3: Variance, Covariance, and Correlation 


Then E(X) = (2)(1/2) + (3)(1/6) + (4)(1/6) + (5)(1/6) = 3. Hence, 
2 l 2 l 2 l 2 1 
Var(X) = ((2 — 3) 5 +(3= 3) Jz + (4-3) JE +(0°=3) Ne = 4/3.0 


EXAMPLE 3.3.3 
Let Y ~ Bernoulli(@@). Then E(Y) = 0. Hence, 


Var(Y) = EY -6)*) = -0A + 0-A -8) 
0—28 +0 +0 -0 = 0-0 =0(1-0).10 


The square in (3.3.1) implies that the “scale” of Var(X) is different from the scale 
of X. For example, if X were measuring a distance in meters (m), then Var(X) would 
be measuring in meters squared (m7). If we then switched from meters to feet, we 
would have to multiply X by about 3.28084 but would have to multiply Var(X) by 
about (3.28084). 

To correct for this “scale” problem, we can simply take the square root, as follows. 


Definition 3.3.2 The standard deviation of a random variable X is the quantity 


ox =Sd(X) = VVar) = E(X - u y)). 


It is reasonable to ask why, in (3.3.1), we need the square at all. Now, if we simply 
omitted the square and considered E((X — u y)), we would always get zero (because 
Lx = E(X)), which is useless. On the other hand, we could instead use £ (|X — u yl). 
This would, like (3.3.1), be a valid measure of the average distance of X from uy. 
Furthermore, it would not have the “scale problem” that Var(X) does. However, we 
shall see that Var(X) has many convenient properties. By contrast, E (|X — uyl) is 
very difficult to work with. Thus, it is purely for convenience that we define variance 
by E((X — wy)’) instead of E (|X — u yl). 

Variance will be very important throughout the remainder of this book. Thus, we 
pause to present some important properties of Var. 


Theorem 3.3.1 Let X be any random variable, with expected value wy = E(X), 
and variance Var(X). Then the following hold true: 

(a) Var(X) > 0. 

(b) Ifa and b are real numbers, Var(a X + b) = a? Var(X). 


(c) Var(X) = E (X°) — (u y)? = E(X?) — E(X}. (That is, variance is equal to the 
second moment minus the square of the first moment.) 
(d) Var(X) < E(X?). 


PROOF | (a) This is immediate, because we always have (X — u y} > 0. 
(b) We note that 4ax+p = E (aX + b) = a E(X) +b = a u y + b, by linearity. Hence, 
again using linearity, 


Var(aX + b) 


E (ax + b— jax+b)) =E (ax + b—apy — by’) 


aE (œ — nx?) = a’ Var(X). 


Chapter 3: Expectation 151 


(c) Again, using linearity, 


Var(X) 


E ((x- nx’) =E (X —2Xuyx + (xx)’) 


E(X) — 2E(X)uy + (uy)? = E(X) — (uy)? + (uy)? 
E(X) — (uy). 


(d) This follows immediately from part (c) because we have —(u y)? < 0. E 


Theorem 3.3.1 often provides easier ways of computing variance, as in the follow- 
ing examples. 


EXAMPLE 3.3.4 Variance of the Exponential(A) Distribution 
Let W ~ Exponential(A), so that fy (w) = 4e~*”. Then E(W) = 1/2. Also, using 


integration by parts, 
Co oO 
f whe?’ dw i 2we~*”’dw 
0 0 


(2/4) | ” phe dw = (2/A)E(W) = 2/27. 


E(W’) 


Hence, by part (c) of Theorem 3.3.1, 
Var(W) = E (W?) — (E(W))* = (2/27) — (1/4? = 1/270 


EXAMPLE 3.3.5 
Let W ~ Exponential(/), and let Y = 5W + 3. Then from the above example, 
Var(W) = 1/42. Then, using part (b) of Theorem 3.3.1, 

Var(Y) = Var(5W +3) = 25 Var(W) = 25//7.8 


Because Va? = |a|, part (b) of Theorem 3.3.1 immediately implies a correspond- 
ing fact about standard deviation. 


Corollary 3.3.1 Let X be any random variable, with standard deviation Sd(X), and 


let a be any real number. Then Sd(a X) = |a| Sd(X). 


EXAMPLE 3.3.6 
Let W ~ Exponential(4), and let Y ~ 5W + 3. Then using the above examples, we 


see that Sd(W) = (Var(W))!/? = (1/2)'" = 1/2. Also, Sd(Y) = (Var(¥))'/? = 
(25/47) nies 5/4. This agrees with Corollary 3.3.1, since Sd(Y) = |5|Sd(W). E 
EXAMPLE 3.3.7 Variance and Standard Deviation of the N (u, 07) Distribution 
Suppose that X ~ N(u, o°). In Example 3.2.7 we established that E (X) = u. Now 


we compute Var(X). 
First consider Z ~ N(O, 1). Then from Theorem 3.3.1(c) we have that 


Var (Z) = E (27) =| #> ow || dz. 


152 Section 3.3: Variance, Covariance, and Correlation 


Then, putting u = z, dv = z exp {—z”/2} (so du = 1, v = — exp {—z?/2}), and using 
integration by parts, we obtain 


1 z2)” Sox iT z? 
Var(Z) = -=z -7 +f see {5} e= 
-00 —00 


and Sd(Z) = 1. 
Now, for o > 0, put X = u + o Z. We then have X ~ N (u, 07). From Theorem 
3.3.1(b) we have that 


Var (X) = Var (u + o Z) = o? Var (Z) = o° 


and Sd(X) = ø. This establishes the variance of the N (u, o°) distribution as o? and 
the standard deviation as o. 

In Figure 3.3.1, we have plotted three normal distributions, all with mean 0 but 
different variances. 


Figure 3.3.1: Plots of the the N (0, 1) (solid line), the N (0, 1/4) (dashed line) and the 
N (0, 4) (dotted line) density functions. 


The effect of the variance on the amount of spread of the distribution about the mean 
is quite clear from these plots. As ø? increases, the distribution becomes more diffuse; 
as it decreases, it becomes more concentrated about the mean 0. E 


So far we have considered the variance of one random variable at a time. How- 
ever, the related concept of covariance measures the relationship between two random 
variables. 


Definition 3.3.3 The covariance of two random variables X and Y is given by 


Cov(X, Y) = E(X— ux) — uy)), 


where u y = E(X) and wy = E(Y). 


EXAMPLE 3.3.8 
Let X and Y be discrete random variables, with joint probability function py y given 


Chapter 3: Expectation 153 


by 
1/2 x=3,y=4 
_] 1/3 x=3,y=6 
Pxy(*,y) = 1/6 x=5,y=6 
0 otherwise. 


Then E(X) = (3)(1/2) + G)(1/3) + (5)(1/6) = 10/3, and E(Y) = (4)(1/2) + 
(6)(1/3) + (6)(1/6) = 5. Hence, 
Cov(X, Y) = E((X — 10/3)(Y —5)) 
= (3 — 10/3) (4 —5)/2 + B — 10/3) (6 —5)/3 + (5 — 10/3) (6 — 5) /6 
= 1/3.0 
EXAMPLE 3.3.9 


Let X be any random variable with Var(X) > 0. Let Y = 3X, and let Z = —4X. Then 
My = 3uy and uz = —4u y. Hence, 


Cov(X, Y) = E(X -uy — uy)) = E(X - ny) BX — 3u x)) 
3 E(X — wy)’) =3 Var(X), 


while 
Cov(X,Z) = E(X- ux)X(Z - uz)) = E(X — uy (-4)X — (—4)u x)) 
= (-4)E(X — u x)) = —4 Var(X). 
Note in particular that Cov(X, Y) > 0, while Cov(X, Z) < 0. Intuitively, this says that 
Y increases when X increases, whereas Z decreases when X increases. E 


We begin with some simple facts about covariance. Obviously, we always have 
Cov(X, Y) =Cov(Y, X). We also have the following result. 


Theorem 3.3.2 (Linearity of covariance) Let X, Y, and Z be three random vari- 
ables. Let a and b be real numbers. Then 


Cov(aX + bY, Z) = a Cov (X, Z) + b Cov (Y, Z). 


Note that by linearity, ugyipy = El(aX + bY) = aE(X) + bE (Y) = 
auy +buy. Hence, 
Cov(aX +bY, Z) = E ((aX+bY — nax+by)(Z — uz)) 
E ((aX + bY -aux — buy)(Z — uz)) 
= E (aX —apy +bY —buy)(Z — uz)) 
aE ((X — ux)(Z — uz)) + bE ((Y — uy)(Z — uz)) 
= aCov(X, Z) + bCov(Y, Z), 


and the result is established. E 


We also have the following identity, which is similar to Theorem 3.3.1(c). 


154 Section 3.3: Variance, Covariance, and Correlation 


Theorem 3.3.3 Let X and Y be two random variables. Then 


Cov(X, Y) = E(XY) — E(X)E(Y). 


PROOF | Using linearity, we have 

Cov(X,Y) = E(X—uy)Y -uy)) = E (AY — uxY —- Xuy + uxuy) 
E(XY) — uxE (Y) — E(X)uy + u xuy 
E(XY) — uxuy — Hytty + uxty = E(XY)— uxuy.i 


Corollary 3.3.2 If X and Y are independent, then Cov (X, Y) = 0. 


PROOF | Because X and Y are independent, we know (Theorems 3.1.3 and 3.2.3) 
that E (XY) = E(X) E (Y). Hence, the result follows immediately from Theorem 3.3.3. 
I 


We note that the converse to Corollary 3.3.2 is false, as the following example 
shows. 


EXAMPLE 3.3.10 Covariance 0 Does Not Imply Independence. 
Let X and Y be discrete random variables, with joint probability function py y given 
by 


1/4 x=3,y=5 
1/4 x=4,y=9 

pxy@y=4 1/4  x=7,y=5 
1/4 x=6, y=9 
0 otherwise. 


Then E(X) = (3)(1/4) + 90/4 + M/A + 64/4 = 5, EY) = ()0/4) + 
(9)(1/4) + (5)(1/4) + @)(1/4) = 7, and E(XY) = G)(5)(1/4) + (4)(9)(1/4) + 
(7)(5)(1/4) + (6)9)C/4) = 35. We obtain Cov(X, Y) = E(XY) — E(X E (Y) = 
35 — (5)(7) = 0. 

On the other hand, X and Y are clearly not independent. For example, P(X = 
4) > Oand P(Y = 5) > 0, but P(X = 4, Y = 5) = 0, so P(X = 4, Y = 5) Æ 
P(X =4)P(Y =S).1 


There is also an important relationship between variance and covariance. 


Theorem 3.3.4 
(a) For any random variables X and Y, 


Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y). 


(b) More generally, for any random variables X1, ..., Xn, 


Var (=x) = > Var(X) + 2 $` Cov(Xi, Xj). 


i<j 


Chapter 3: Expectation 155 


PROOF | We prove part (b) here; part (a) then follows as the special case n = 2. 
Note that by linearity, 


Ly x, = e(£x) =D EX) =) ax. 


Therefore, we have that 


(r E) 


=2(S — Bi) (Xj -1))= DEO — ";) (Xj — u ;)) 
i,j i,j 


= J E(X — u) (Xj — wy) +2 EX — u) X; — uy) 
i=j 


i<j 
= >! Va) +2 $ Cov(%, Xj). 
i i<j 


Combining Theorem 3.3.4 with Corollary 3.3.2, we obtain the following. 


Corollary 3.3.3 
(a) If X and Y are independent, then Var(X + Y) = Var(X) + Var(Y). 


(b) If X1,..., Xn are independent, then Var(>77_, Xi) = X; Var(X%j). 


One use of Corollary 3.3.3 is the following. 
EXAMPLE 3.3.11 
Let Y ~ Binomial(n, 0). What is Var(Y)? Recall that we can write 
Y=Xi+X24+---+Xn, 


where the X; are independent, with X; ~ Bernoulli(@). We have already seen that 
Var(X;) = 0(1 — 0). Hence, from Corollary 3.3.3, 


Var(Y) Var(X1) + Var(X2) + +--+ Var(Xn) 


= 90 =6) +00 —8) ++: +00 =6) =n00 —6).8 


Another concept very closely related to covariance is correlation. 


156 Section 3.3: Variance, Covariance, and Correlation 


Definition 3.3.4 The correlation of two random variables X and Y is given by 


X,Y PEDA 
eee ys EO ea 


Sd(X) Sd(Y) Var(X) Var) 
provided 0 < Var(X) < coand 0 < Var(Y) < oo. 


EXAMPLE 3.3.12 

As in Example 3.3.2, let X be any random variable with Var(X) > 0, let Y = 3X, and 
let Z = —4X. Then Cov(X, Y) = 3 Var(X) and Cov(X, Z) = —4 Var(X). But by 
Corollary 3.3.1, Sd(Y) = 3 Sd(X) and Sd(Z) = 4 Sd(X). Hence, 


Cov(X, Y) — 3 Var(X) Va) 
Sd(X) Sd(Y) — Sd(X)3Sd(X) — Sd(X)?2 


Corr(X, Y) = 


because Sd(X)* = Var(X). Also, we have that 


Com(X, Z) = Cov(X,Z) —  —4Var(X) _ _ Var(X) as 
om(X, 2) = saa sd(Z) ~ Sas Sd) saa T 


Intuitively, this again says that Y increases when X increases, whereas Z decreases 
when_X increases. However, note that the scale factors 3 and —4 have cancelled out; 
only their signs were important. E 


We shall see later, in Section 3.6, that we always have —1 < Corr(X, Y) < 1, for 
any random variables X and Y. Hence, in Example 3.3.12, Y has the largest possible 
correlation with X (which makes sense because Y increases whenever X does, without 
exception), while Z has the smallest possible correlation with X (which makes sense 
because Z decreases whenever X does). We will also see that Corr(X, Y) is a measure 
of the extent to which a linear relationship exists between X and Y. 


EXAMPLE 3.3.13 The Bivariate Normal (u1, M2, 01,02, p) Distribution 

We defined this distribution in Example 2.7.9. It turns out that when (X, Y) follows this 
joint distribution then, (from Problem 2.7.13) X ~ N (u, a?) and Y ~ N(uo, o3). 
Further, we have that (see Problem 3.3.17) Corr(X, Y) = p. In the following graphs, 
we have plotted samples of n = 1000 values of (X, Y) from bivariate normal distrib- 
utions with uw; = u2 = 0, o? = oF = 1, and various values of p. Note that we used 
(2.7.1) to generate these samples. 

From these plots we can see the effect of on the joint distribution. Figure 3.3.2 
shows that when p = 0, the point cloud is roughly circular. It becomes elliptical in 
Figure 3.3.3 with p = 0.5, and more tightly concentrated about a line in Figure 3.3.4 
with p = 0.9. As we will see in Section 3.6, the points will lie exactly on a line when 
p=. 

Figure 3.3.5 demonstrates the effect of a negative correlation. With positive corre- 
lations, the value of Y tends to increase with X, as reflected in the upward slope of the 
point cloud. With negative correlations, Y tends to decrease with X, as reflected in the 
negative slope of the point cloud. E 


Chapter 3: Expectation 157 


Figure 3.3.2: A sample ofn = 1000 values (X, Y) from the Bivariate Normal (0, 0, 1, 1, 0) 
distribution. 


Figure 3.3.3: A sample of n = 1000 values (X, Y) from the Bivariate Normal 
(0, 0, 1, 1, 0.5) distribution. 


158 Section 3.3: Variance, Covariance, and Correlation 


2 — 


Figure 3.3.4: A sample of n = 1000 values (X, Y) from the Bivariate Normal 
(0, 0, 1, 1, 0.9) distribution. 


Figure 3.3.5: A sample of n = 1000 values (X, Y) from the Bivariate Normal 
(0, 0, 1, 1, —0.9) distribution. 


Chapter 3: Expectation 159 


Summary of Section 3.3 


The variance of a random variable X measures how far it tends to be from its 
mean and is given by Var(X) = E((X — wy)?) = E(X”) — (E(X)? 

The variances of many standard distributions were computed. 

The standard deviation of X equals Sd(X) = /VarCX). 

Var(X) > 0, and Var(aX + b) = a? Var(X); also Sd(a X + b) = |a| Sd(X). 
The covariance of random variables X and Y measures how they are related and 
is given by Cov(X, Y) = E(X — ux) — u,)) = E(XY) - F(X) EY). 

If X and Y are independent, then Cov(X, Y) = 0. 


Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X, Y). If X and Y are independent, 
this equals Var(X) + Var(Y) . 


The correlation of X and Y is Corr(X, Y) = Cov(X, Y)/(Sd(X) Sd(Y)). 


EXERCISES 
3.3.1 Suppose the joint probability function of X and Y is given by 
1/2 x=3,y=5 
1/6 x=3,y=9 
Px,y(x,y) = 4 1/6 x=6, y=5 
1/6 x=6, y=9 
0 otherwise, 


with E(X) = 4, E(Y) = 19/3, and E (XY) = 26, as in Example 3.1.16. 
(a) Compute Cov(X, Y). 

(b) Compute Var(X) and Var(Y). 

(c) Compute Corr(X, Y). 


3.3.2 Suppose the joint probability function of X and Y is given by 
1/7 x=5,y=0 
1/7 x=5, y=3 
= 1/7 x= 5, y = 4 
px,y&,y) = 377 EER PÙ 
1/7 x=8, y=4 
0 otherwise, 
as in Example 2.7.5. 
(a) Compute E(X) and E(Y). 
(b) Compute Cov(X, Y). 
(c) Compute Var(X) and Var(Y). 
(d) Compute Corr(X, Y). 
3.3.3 Let X and Y have joint density 
_ f 4x2y +2y° 0<x<1,0<y<1 
Jkr = | 0 otherwise, 


160 Section 3.3: Variance, Covariance, and Correlation 


as in Exercise 3.2.2. Compute Corr(X, Y). 
3.3.4 Let X and Y have joint density 


ISx°y* + 6x°y? = Ox 1, 0<y<1 
Sxy@,y) = | 0 otherwise. 


Compute E(X), E (Y), Var(X), Var(Y), Cov(X, Y), and Corr(X, Y). 

3.3.5 Let Y and Z be two independent random variables, each with positive variance. 
Prove that Corr(Y, Z) = 0. 

3.3.6 Let X, Y, and Z be three random variables, and suppose that X and Z are inde- 
pendent. Prove that Cov(X + Y, Z) =Cov(Y, Z). 

3.3.7 Let X ~ Exponential (3) and Y ~ Poisson(5). Assume X and Y are independent. 
Let Z = X +Y. 

(a) Compute Cov(X, Z). 

(b) Compute Corr(X, Z). 

3.3.8 Prove that the variance of the Uniform[Z, R] distribution is given by the expres- 
sion (R — L)?/12. 

3.3.9 Prove that Var(X) = E(X (X — 1)) — E(X)E(X — 1). Use this to compute 
directly from the probability function that when X ~ Binomial(n, 0), then Var(X) = 
nd (1 — 0). 

3.3.10 Suppose you flip three fair coins. Let X be the number of heads showing, and 
let Y = X*. Compute E(X), E(Y), Var(X), Var(Y), Cov(X, Y), and Corr(X, Y). 
3.3.11 Suppose you roll two fair six-sided dice. Let X be the number showing on the 
first die, and let Y be the sum of the numbers showing on the two dice. Compute E (X), 
E(Y), E(XY), and Cov(X, Y). 

3.3.12 Suppose you flip four fair coins. Let X be the number of heads showing, and 
let Y be the number of tails showing. Compute Cov(X, Y) and Corr(X, Y). 

3.3.13 Let X and Y be independent, with X ~ Bernoulli(1/2) and Y ~ Bernoulli(1 /3). 
Let Z = X + Y and W = X — Y. Compute Cov(Z, W) and Corr(Z, W). 

3.3.14 Let X and Y be independent, with X ~ Bernoulli (1/2) and Y ~ N (0,1). Let 
Z = X +Y and W = X — Y. Compute Var(Z), Var(W), Cov(Z, W), and Corr(Z, W). 
3.3.15 Suppose you roll one fair six-sided die and then flip as many coins as the num- 
ber showing on the die. (For example, if the die shows 4, then you flip four coins.) Let 
X be the number showing on the die, and Y be the number of heads obtained. Compute 
Cov(X, Y). 


PROBLEMS 


3.3.16 Let X ~ N (0, 1), and let Y = cX. 

(a) Compute lime xo Cov (X, Y). 

(b) Compute lime a0 Cov (X, Y). 

(c) Compute lime ~o Corr(X, Y). 

(d) Compute lime ag Corr (X, Y). 

(e) Explain why the answers in parts (c) and (d) are not the same. 


Chapter 3: Expectation 161 


3.3.17 Let X and Y have the bivariate normal distribution, as in Example 2.7.9. Prove 
that Corr(X, Y) = p. (Hint: Use (2.7.1).) 

3.3.18 Prove that the variance of the Geometric(@) distribution is given by (1 — 0) /07. 
(Hint: Use Exercise 3.3.9 and (a — oy)” =x(x —1)(1-0)~?,) 

3.3.19 Prove that the variance of the Negative-Binomial(r, 0) distribution is given by 
r(1 — 6)/07. (Hint: Use Problem 3.3.18.) 


3.3.20 Let a > 0 and 4 > 0, and let X ~ Gamma(a, 4). Prove that Var(X) = a//?. 
(Hint: Recall Problem 3.2.16.) 

3.3.21 Suppose that X ~ Weibull(a) distribution (see Problem 2.4.19). Prove that 
Var(X) =T (2/a + 1) — T? (1/a + 1). (Hint: Recall Problem 3.2.18.) 

3.3.22 Suppose that X ~ Pareto(a) (see Problem 2.4.20) for a > 2. Prove that 
Var(X) = a/((a — 1)? (a — 2)). (Hint: Recall Problem 3.2.19.) 


3.3.23 Suppose that X follows the Laplace distribution (see Problem 2.4.22). Prove 
that Var(X) = 2. (Hint: Recall Problem 3.2.21.) 


3.3.24 Suppose that X ~ Beta(a,b) (see Problem 2.4.24). Prove that Var(X) = 
ab/((a +b)’ (a +b + 1)). (Hint: Recall Problem 3.2.22.) 


3.3.25 Suppose that (X1, X2, X3) ~ Multinomial(n, 01, 02,03). Prove that 
Var(X;) =n0;(1 — 6;), Cov(Xi, Xj) = —n0;0;, wheni Æ j. 


(Hint: Recall Problem 3.1.23.) 
3.3.26 Suppose that (X1, X2) ~ Dirichlet(@,, a2, a3) (see Problem 2.7.17). Prove 
that 
aj (a, +a2 +43 — ai) 
(a, +a2 +43)? (a1 +a2 +0341)’ 
—a1a2 
(a, +a2 +43)" (a1 +a2 +0341) 


Var(X;) = 
Cov (X1, X2) 


(Hint: Recall Problem 3.2.23.) 
3.3.27 Suppose that X ~ Hypergeometric(N, M, n). Prove that 


P E A= 
ar(X) = n— | 1 — — | —. 
"N NJ W-1 


(Hint: Recall Problem 3.1.21 and use Exercise 3.3.9.) 

3.3.28 Suppose you roll one fair six-sided die and then flip as many coins as the num- 
ber showing on the die. (For example, if the die shows 4, then you flip four coins.) Let 
X be the number showing on the die, and Y be the number of heads obtained. Compute 
Corr(X, Y). 


CHALLENGES 


3.3.29 Let Y be a nonnegative random variable. Prove that E(Y) = 0 if and only if 
P(Y = 0) = 1. (You may assume for simplicity that Y is discrete, but the result is true 
for any Y.) 


162 Section 3.4: Generating Functions 


3.3.30 Prove that Var(X) = 0 if and only if there is a real number c with P(X = c) = 
1. (You may use the result of Challenge 3.3.29.) 

3.3.31 Give an example of a random variable X, such that E (X) = 5, and Var(X) = 
00. 


3.4 | Generating Functions 


Let X be a random variable. Recall that the cumulative distribution function of X, 
defined by Fy(x) = P(X < x), contains all the information about the distribution of 
X (see Theorem 2.5.1). It turns out that there are other functions — the probability- 
generating function and the moment-generating function — that also provide informa- 
tion (sometimes all the information) about X and its expected values. 


Definition 3.4.1 Let X be a random variable (usually discrete). Then we define its 


probability-generating function, ry, by ry(t) = E(t*) fort e R!. 


Consider the following examples of probability-generating functions. 


EXAMPLE 3.4.1 The Binomial(n, 0) Distribution 
If X ~ Binomial(n, 0), then 


rx(t) = EC) => PA=) 
i=0 
L [Nai n—i,i 
= > (jea- t 
afn i n-i _ on 
= D(a- = (10+1-0)", 


using the binomial theorem. E 


EXAMPLE 3.4.2 The Poisson(,) Distribution 
If Y ~ Poisson(A), then 


HO = ECV] Pe Sh =) ete 
i=0 i=0 kz 
oO 


At) : 
See — eet — pit“) g 


i! 


i=0 


The following theorem tells us that once we know the probability-generating func- 
tion ry(¢), then we can compute all the probabilities P(X = 0), P(X = 1), P(X = 2), 
etc. 


Chapter 3: Expectation 163 


Theorem 3.4.1 Let X be a discrete random variable, whose possible values are all 
nonnegative integers. Assume that ry (to) < co for some fy > 0. Then 


rx0) = P(X =0), 
r40) = P(X=1), 
re) = 2P(X=2), 


etc. In general, 
rOO) = kP =k), 


(k) 
X 


where rẹ is the kth derivative of ry. 


PROOF | Because the possible values are all nonnegative integers of the form i = 
0,1,2,... , we have 


rx) = EO) = eet ax) = PK =i) 
x i=0 
P(X =0) +21 P(X =1) 4+ P P(X =2) 4 BP(X =3)4--, 
so that 
rxt) = 1P(X =0) +t P(X = 1) 4+? P(X =2)4+- PO P(X =3)4---. (8.4.1) 


Substituting ¢ = 0 into (3.4.1), every term vanishes except the first one, and we obtain 
rxy(0) = P(X = 0). Taking derivatives of both sides of (3.4.1), we obtain 
r(t) = 1P(X = 1) +20! P(X =2) +3 P(X =3) +=, 


and setting £ = 0 gives r(0) = P(X = 1). Taking another derivative of both sides 
gives 
rI) =2P(X =2)4+3-2t' P(X =3)4--- 

and setting ¢ = 0 gives r.(0) = 2 P(X = 2). Continuing in this way, we obtain the 
general formula. E 

We now apply Theorem 3.4.1 to the binomial and Poisson distributions. 
EXAMPLE 3.4.3 The Binomial(n, 0) Distribution 
From Example 3.4.1, we have that 


rx© = (-6)" 


ry) = n(t0+1—-6)"'@) i w= oy'—!9 
re) = n= 1)G0+1- 0)" @@O) | _ =r- HA -ay"6", 
etc. It is thus verified directly that 
P(X=1) = ri) 


2P(X=2) = r40), 


164 Section 3.4: Generating Functions 


etc. E 


EXAMPLE 3.4.4 The Poisson(,) Distribution 
From Example 3.4.2, we have that 


rx(0) = e” 
ry(0) = 2e 
rO = Wer, 
etc. It is again verified that 
P(X=0) = rx) 
P(X=1) = ry(0) 
2P(X=2) = r%0), 


etc. E 


From Theorem 3.4.1, we can see why ry is called the probability-generating func- 
tion. For, at least in the discrete case with the distribution concentrated on the non- 
negative integers, we can indeed generate the probabilities for X from ry. We thus 
see immediately that for a random variable X that takes values only in {0, 1, 2,...}, 
rx is unique. By this we mean that if X and Y are concentrated on {0,1,2,...} and 
ry = ry, then X and Y have the same distribution. This uniqueness property of the 
probability-generating function can be very useful in trying to determine the distribu- 
tion of a random variable that takes only values in {0, 1,2,...}. 

It is clear that the probability-generating function tells us a lot — in fact, everything 
— about the distribution of random variables concentrated on the nonnegative integers. 
But what about other random variables? It turns out that there are other quantities, 
called moments, associated with random variables that are quite informative about their 
distributions. 


Definition 3.4.2 Let X be a random variable, and let k be a positive integer. Then 


the Ath moment of X is the quantity E (x H provided this expectation exists. 


Note that if E (X*) exists and is finite, it can be shown that £ (X^) exists and is finite 
when 0 </ < k. 

The first moment is just the mean of the random variable. This can be taken as 
a measure of where the central mass of probability for X lies in the real line, at least 
when this distribution is unimodal (has a single peak) and is not too highly skewed. The 
second moment E (X°), together with the first moment, gives us the variance through 
Var(X) = E(X?) — (E(X))* . Therefore, the first two moments of the distribution tell 
us about the location of the distribution and the spread, or degree of concentration, of 
that distribution about the mean. In fact, the higher moments also provide information 
about the distribution. 

Many of the most important distributions of probability and statistics have all of 
their moments finite; in fact, they have what is called a moment-generating function. 


Chapter 3: Expectation 165 


Definition 3.4.3 Let X be any random variable. Then its moment-generating func- 


tion my is defined by m y(s) = E(e’*) ats e R!. 


The following example computes the moment-generating function of a well-known 
distribution. 


EXAMPLE 3.4.5 The Exponential(,) Distribution 
Let X ~ Exponential(/). Then for s < A, 


9 oO 
mx(s) E(e*) -{ e* fx (x) dx z. &* e% dx 
=00 
f aak Jaz fea x=00 Eo 


0 s—A 
A 

= — =A(A—s) 1.8 
ete) 


A comparison of Definitions 3.4.1 and 3.4.3 immediately gives the following. 


Theorem 3.4.2 Let X be any random variable. Then m y(s) =ry(e’). 


This result can obviously help us evaluate some moment-generating functions when 
we have ry already. 


EXAMPLE 3.4.6 

Let Y ~ Binomial(7, 0). Then we know that ry (t) = (t0 + 1 — 0)”. Hence, my(s) = 
ry(e’) = (e&@+1-0)".0 

EXAMPLE 3.4.7 


Let Z ~ Poisson(A). Then we know that rz(t) = e*“—)). Hence, mz(s) = rz(e’) = 
A(e —1) 
e a | 


x=0 s—ih 


The following theorem tells us that once we know the moment-generating function 
m x(t), we can compute all the moments E(X), E(X?), E(X°), ete. 


Theorem 3.4.3 Let X be any random variable. Suppose that for some sọ > 0, it is 
true that mx(s) < co whenever s € (—so, so). Then 

m x(0) = 

my0) = E(X) 

my) = E(X*), 


etc. In general, 
k 
m2) = EX), 


where mË is the kth derivative of m y. 


PROOF | We know that m y(s) = E(e°*). We have 


mx(0) = E(e*) = E(e®) = E(1) =1. 


166 Section 3.4: Generating Functions 


Also, taking derivatives, we see? that m',(s) = E(X e*), so 
m'y(0) = E(Xe*) = E(Xe°) = E(X). 
Taking derivatives again, we see that m’ (s) = E (Xes), so 
m” (0) = E(X? e*) = E(X7e°) = E (XP). 
Continuing in this way, we obtain the general formula. E 


We now consider an application of Theorem 3.4.3. 


EXAMPLE 3.4.8 The Mean and Variance of the Exponential (à) Distribution 
Using the moment-generating function computed in Example 3.4.5, we have 


m'y(s) = (DAC —s)-?(-1) = AG = s). 
Therefore, 

E(X) = m0) = 4-0)? = 4/4? = 1/4, 
as it should. Also, 

E(X?) = m" (0) = (2)4 (4 — 0) 3 (-1) = 24/27 = 2/2’, 
so we have 
Var(X) = E(X’) — (EX? = (2/2) — 0/4? = 1/2. 

This provides an easy way of computing the variance of X. E 


EXAMPLE 3.4.9 The Mean and Variance of the Poisson(A) Distribution 
In Example 3.4.7, we obtained m z (s) = exp (A (e° — 1)). So we have 


E(X) = m%(0) = 2e exp (ae = 1) =1 
2 
E(X’) = m',(0)= 1e? exp (ae — 1)) + (2e°) exp (ae — D) =A+)?. 
Therefore, Var(X) = E (XZ) — (E(X)? =14+ 27-2 =/.8 
Computing the moment-generating function of a normal distribution is also impor- 


tant, but it is somewhat more difficult. 


Theorem 3.4.4 If X ~ N(0, 1), thenmy(s) =e". 


PROOF | Because X has density d(x) = (21)~!/? e-*”/2, we have that 


myx(s) = E(eé*) Z= [. e* h(x) dx = ‘i al ai dx 
7 —0o V2 
= ELN a es*—@?/2) as l P e7@-5}/2+6?/2 dx 
N 2T -00 2m —oo 
= enf eo sy /2 dx. 
NV 2m —oo 


3 Strictly speaking, interchanging the order of derivative and expectation is justified by analytic function 
theory and requires that m y(s) < oo whenever |s|] < sọ. 


Chapter 3: Expectation 167 


Setting y = x — s (so that dy = dx), this becomes (using Theorem 2.4.2) 


1 RO 212 2/2 CO 212 
I evBine = pody =e", 
=o —00 


2/2 
mx(s) =e 2 — 
(s) T 

as claimed. E 


One useful property of both probability-generating and moment-generating func- 
tions is the following. 


Theorem 3.4.5 Let X and Y be random variables that are independent. Then we 
have 


(a) rysy(t) =rx(t)ry(), and 
(b) mxsy(t) =mx(t)my(0). 


PROOF | Because X and Y are independent, so are t* and t” (by Theorem 2.8.5). 
Hence, we know (by Theorems 3.1.3 and 3.2.3) that E(t*t?) = E(t*) E(t”). Using 
this, we have 


rxy) SE (ee) =E Gee = EQ) E(t’) =rx(rv(). 
Similarly, 
my+y(t) =E (ee) =E (exe) = E(e'*) E (e) =my(t)my(t). 8 


EXAMPLE 3.4.10 
Let Y ~ Binomial(n, 0). Then, as in Example 3.1.15, we can write 


ne eee E 


where the {X;} are i.i.d. with X; ~ Bernoulli(@). Hence, Theorem 3.4.5 says we must 
have ry (t) =ry,(@)rx,(¢)---rx, ©). But for any i, 


raO =J CP =x) =1'9 +t -6) =r + 1-8. 


Hence, we must have 
ry(t) = (0t+1-06)(@t+1-86)---(t+1-0@)=(6t+1-86)", 


as already verified in Example 3.4.1. E 


Moment-generating functions, when defined in a neighborhood of 0, completely 
define a distribution in the following sense. (We omit the proof, which is advanced.) 


Theorem 3.4.6 (Uniqueness theorem) Let X be a random variable, such that for 
some so > 0, we have mx(s) < co whenever s € (—so, so). Then if Y is some 


other random variable with m y (s) = mx(s) whenever s € (—so, so), then X and Y 
have the same distribution. 


168 Section 3.4: Generating Functions 


Theorems 3.4.1 and 3.4.6 provide a powerful technique for identifying distribu- 
tions. For example, if we determine that the moment-generating function of X is 
my (t) = exp(s?/2), then we know, from Theorems 3.4.4 and 3.4.6, that X ~ 
N(O, 1). We can use this approach to determine the distributions of some complicated 
random variables. 


EXAMPLE 3.4.11 
Suppose that X; ~ N(u;, o?) fori = 1,...,n and that these random variables are 
independent. Consider the distribution of Y = $;_; Xj. 

When n = 1 we have (from Problem 3.4.15) 


gs? 
my (S) = exp į “ys + 5 f 


Then, using Theorem 3.4.5, we have that 
r n o?s? 
ieo eT 
i=l i=l 
n eee o2 s2 
= a + Sve i) À 
i=l 


From Problem 3.4.15, and applying Theorem 3.4.6, we have that 
n n 
Y~ s(a Xo) 
i=l i=l 


Generating functions can also help us with compound distributions, which are de- 
fined as follows. 


my (s) 


Definition 3.4.4 Let X1, X2, .. . be i.i.d., and let N be a nonnegative, integer-valued 
random variable which is independent of the {X;}. Let 


N 
Se: (3.4.2) 


Then S is said to have a compound distribution. 


A compound distribution is obtained from a sum of i.i.d. random variables, where 
the number of terms in the sum is randomly distributed independently of the terms in 
the sum. Note that S = 0 when N = 0. Such distributions have applications in ar- 
eas like insurance, where the X1, X2, ... are claims and N is the number of claims 
presented to an insurance company during a period. Therefore, S represents the total 
amount claimed against the insurance company during the period. Obviously, the in- 
surance company wants to study the distribution of S, as this will help determine what 
it has to charge for insurance to ensure a profit. 

The following theorem is important in the study of compound distributions. 


Chapter 3: Expectation 169 


Theorem 3.4.7 If S has a compound distribution as in (3.4.2), then 
(a) E(S) = E(X))E(N). 


(b) ms(s) =rn(my,(s)). 


PROOF | See Section 3.8 for the proof of this result. E 


3.4.1 | Characteristic Functions (Advanced) 


One problem with moment-generating functions is that they can be infinite in any open 
interval about s = 0. Consider the following example. 


EXAMPLE 3.4.12 
Let X be a random variable having density 


fy eS 1 
Ix@) = | 0 otherwise. 


Then 35 
mx(s) = E(e*) = J e™(1/x?) dx. 
1 


For any s > 0, we know that e** grows faster than x”, so that limx—o e°* /x? = oo. 
Hence, mx(s) = oo whenever s > 0. 
Does X have any finite moments? We have that 


E(X) = f rada F f amwa = Inx|,=P° = 00, 


so, in fact, the first moment does not exist. From this we conclude that X does not have 
any moments. E 


The random variable X in the above example does not satisfy the condition of 
Theorem 3.4.3 that mx(s) < co whenever |s| < so, for some so > 0. Hence, The- 
orem 3.4.3 (like most other theorems that make use of moment-generating functions) 
does not apply. There is, however, a similarly defined function that does not suffer 
from this defect, given by the following definition. 


Definition 3.4.5 Let X be any random variable. Then we define its characteristic 
function, cx, by 


cx(s) = E(e*) (3.4.3) 


fors e R!. 


So the definition of cy is just like the definition of my, except for the introduction 

of the imaginary number i = ./—I. Using properties of complex numbers, we see 

that (3.4.3) can also be written as cx (s) = E(cos(sX)) + i E(sin(sX)) for s € R!. 
Consider the following examples. 


170 Section 3.4: Generating Functions 


EXAMPLE 3.4.13 The Bernoulli Distribution 
Let X ~ Bernoulli(@). Then 


ex) = Ee) = (e) — 0) + (e'O) 
= (1)(1—6)+e%@) =1-6 + de! 
1-—60+6coss +i@sins.0 
EXAMPLE 3.4.14 
Let X have probability function given by 
1/6 x=2 
J]J 1/3 x =3 
POSSE 177 x=4 
0 otherwise. 
Then 
ex(s) = Ee) = (0/0 + e3) + €™)0/2) 
= (1/6)cos2s + (1/3) cos 3s + (1/2) cos 4s 
+(1/6)i sin2s + (1/3)i sin3s + i (1/2) sin 4s. 
EXAMPLE 3.4.15 
Let Z have probability function given by 
1/2 z=1 
pz@=4 1⁄2 z=- 
0 otherwise. 


Then 


ez(s) = Ele?) = (e")(1/2) + (e™)(/2) 
(1/2) cos(s) + (1/2) cos(—s) + (1/2) sin(s) + (1/2) sin(—s) 
(1/2) coss + (1/2) coss + (1/2)sins — (1/2) sins = coss. 


Hence, in this case, cz(s) is a real (not complex) number for all s e R1. 8 

Once we overcome our “fear” of imaginary and complex numbers, we can see 
that the characteristic function is actually much better in some ways than the moment- 
generating function. The main advantage is that, because e/** = cos(s X) + i sin(s X) 
and |e/**| = 1, the characteristic function (unlike the moment-generating function) is 
always finite (although it could be a complex number). 


Theorem 3.4.8 Let X be any random variable, and let s be any real number. Then 
cx(s) is finite. 


The characteristic function has many properties similar to the moment-generating 
function. In particular, we have the following. (The proof is just like the proof of 
Theorem 3.4.3.) 


Chapter 3: Expectation 171 


Theorem 3.4.9 Let X be any random variable with its first k moments finite. Then 
cx(0) = 1, cy (0) = iE(X), c40) = i? E(X*) = —E(X?), etc. In general, 


cf) (0) = it E(X*), where i = ./—1, and where c® is the kth derivative of c NG 
X Xe 


We also have the following. (The proof is just like the proof of Theorem 3.4.5.) 


Theorem 3.4.10 Let X and Y be random variables which are independent. Then 
cx+y (s) =cx(s)cy(s). 


For simplicity, we shall generally not use characteristic functions in this book. 
However, it is worth keeping in mind that whenever we do anything with moment- 
generating functions, we could usually do the same thing in greater generality using 
characteristic functions. 


Summary of Section 3.4 


e The probability-generating function of a random variable X is rx (t) = E(t*). 
e If X is discrete, then the derivatives of r y satisfy r® (0) =k! P(X =k). 

e The kth moment of a random variable X is E(X*). 

e The moment-generating function of a random variable X is my(s) = E(e’*) = 
rx(e*). 

The derivatives of m y satisfy m® (0) = E(x), fork =0,1,2,.... 

e If X and Y are independent, then ry,y(t) = rx(ry(y) and mysy(s) = 
myx(s)my(s). 

If mx(s) is finite in a neighborhood of s = 0, then it uniquely characterizes the 
distribution of X. 


e The characteristic function cy(s) = E(e'**) can be used in place of my(s) to 
avoid infinities. 


EXERCISES 


3.4.1 Let Z bea discrete random variable with P(Z = z) = 1/2? for z = 1,2,3,.... 
(a) Compute z(t). Verify thatr’,(0) = P(Z = 1) andr3(0) =2 P(Z =2). 
(b) Compute m z(t). Verify that m’; (0) = E (Z) and m'4,(0) = E (Z°). 

3.4.2 Let X ~ Binomial(”, 0). Use m x to prove that Var(X) = n0 (1 — 0). 
3.4.3 Let Y ~ Poisson(4). Use my to compute the mean and variance of Y. 
3.4.4 Let Y = 3X +4. Compute ry (t) in terms of ry. 

3.4.5 Let Y = 3X + 4. Compute my (s) in terms of my. 

3.4.6 Let X ~ Binomial(n, 0). Compute E(X°), the third moment of X. 
3.4.7 Let Y ~ Poisson(4). Compute E (Y°), the third moment of Y. 

3.4.8 Suppose P(X = 2) = 1/2, P(X = 5) = 1/3, and P(X = 7) = 1/6. 
(a) Compute r x(t) for t € R!. 

(b) Verify that r4 (0) = P(X = 1) and r% (0) = 2P (X = 2). 


172 Section 3.4: Generating Functions 


(c) Compute m x(s) fors € R!. 
(d) Verify that m’ (0) = E (X) and m” (0) = E (X°). 


PROBLEMS 


3.4.9 Suppose fy(x) = 1/10 for0 <x < 10, with fy(x) = 0 otherwise. 
(a) Compute m x(s) fors € R!. 

(b) Verify that m‘,(0) = E(X). (Hint: L’H6pital’s rule.) 

3.4.10 Let X ~ Geometric(0). Compute r y(t) and r% (0)/2. 

3.4.11 Let X ~ Negative-Binomial (r, 0). Compute r x(t) and r% (0)/2. 
3.4.12 Let X ~ Geometric (0). 

(a) Compute m y (s). 

(b) Use m y to compute the mean of X. 

(c) Use m y to compute the variance of X. 

3.4.13 Let X ~ Negative-Binomial (r, 0). 

(a) Compute m x (s). 

(b) Use m y to compute the mean of X. 

(c) Use m y to compute the variance of X. 

3.4.14 If Y = a + bX, where a and b are constants, then show that ry (t) = t@ry(t?) 
and my (t) = e“my(bt). 

3.4.15 Let Z ~ N(u, o°). Show that 


o7s? 
mz(s) = exp poet, : 


(Hint: Write Z = u + o X where X ~ N (0, 1), and use Theorem 3.4.4.) 

3.4.16 Let Y be distributed according to the Laplace distribution (see Problem 2.4.22). 
(a) Compute my (s). (Hint: Break up the integral into two pieces.) 

(b) Use my to compute the mean of Y. 

(c) Use my to compute the variance of Y. 

3.4.17 Compute the kth moment of the Weibull(a) distribution in terms of T (see 
Problem 2.4.19). 

3.4.18 Compute the Ath moment of the Pareto(a) distribution (see Problem 2.4.20). 
(Hint: Make the transformation u = (1 + x)! and recall the beta distribution.) 

3.4.19 Compute the kth moment of the Log-normal(r) distribution (see Problem 2.6.17). 
(Hint: Make the transformation z = Inx and use Problem 3.4.15.) 

3.4.20 Prove that the moment-generating function of the Gamma(a, 4) distribution is 
given by 24% / (A — t)* whent < å. 

3.4.21 Suppose that X; ~ Poisson(/;) and X1, ..., Xn are independent. Using moment- 
generating functions, determine the distribution of Y = $; Xj. 

3.4.22 Suppose that X; ~ Negative-Binomial(r;, 0) and X1, . . . , Xn are independent. 
Using moment-generating functions, determine the distribution of Y = X ¢_; Xj. 
3.4.23 Suppose that X; ~ Gamma(a;, 2) and X1, ..., Xn are independent. Using 
moment-generating functions, determine the distribution of Y = }°"_, X;. 


Chapter 3: Expectation 173 


3.4.24 Suppose X1, X2,... is i.1.d. Exponential(A) and N ~ Poisson(A) independent 
of the {X;}. Determine the moment-generating function of Sy. Determine the first 
moment of this distribution by differentiating this function. 

3.4.25 Suppose X1, X2,... are i.i.d. Exponential(A) random variables and N ~ 
Geometric(@) , independent of the {X;}. Determine the moment-generating function 
of Sy. Determine the first moment of this distribution by differentiating this function. 
3.4.26 Let X ~ Bernoulli(@). Use cy(s) to compute the mean of X. 

3.4.27 Let Y ~ Binomial(n, 0). 

(a) Compute the characteristic function cy(s). (Hint: Make use of cx(s) in Problem 
3.4.26.) 

(b) Use cy(s) to compute the mean of Y. 

3.4.28 The characteristic function of the Cauchy distribution (see Problem 2.4.21) is 
given by c(t) = e~"!!, Use this to determine the characteristic function of the sample 


mean 
_ ee 
x= -5X 
n 
i=l 


based on a sample of n from the Cauchy distribution. Explain why this implies that the 
sample mean is also Cauchy distributed. What do you find surprising about this result? 
3.4.29 The k-th cumulant (when it exists) of a random variable X is obtained by cal- 
culating the k-th derivative of In cy(s) with respect to s, evaluating this at s = 0, and 
dividing by i*. Evaluate cy(s) and all the cumulants of the N (u, 07) distribution. 


3.5 | Conditional Expectation 


We have seen in Sections 1.5 and 2.8 that conditioning on some event, or some random 
variable, can change various probabilities. Now, because expectations are defined in 
terms of probabilities, it seems reasonable that expectations should also change when 
conditioning on some event or random variable. Such modified expectations are called 
conditional expectations, as we now discuss. 


3.5.1 | Discrete Case 


The simplest case is when X is a discrete random variable, and A is some event of 
positive probability. We have the following. 


Definition 3.5.1 Let X be a discrete random variable, and let A be some event with 
P(A) > 0. Then the conditional expectation of X, given A, is equal to 
P(X =x, A) 

P(A) 


ECG WAy = Y ek (Meri) a 


xeR! xeR! 


EXAMPLE 3.5.1 
Consider rolling a fair six-sided die, so that S = {1, 2,3, 4,5, 6}. Let X be the number 


174 Section 3.5: Conditional Expectation 


showing, so that X (s) = s for s € S. Let A = {3, 5, 6} be the event that the die shows 
3, 5, or 6. What is E(X | A)? 
Here we know that 


P(X =3|A) = P(X =3|X =3,5, or 6) = 1/3 
and that, similarly, P(X = 5| A) = P(X = 6 | A) = 1/3. Hence, 
E(X | A) >) x P(X =x] A) 
xeR! 
3 P(X =3|A) +5 P(X =5|A) +6 P(X =6| A) 
3(1/3) + 51/3) + 61/3) = 14/3.8 
Often we wish to condition on the value of some other random variable. If the other 


random variable is also discrete, and if the conditioned value has positive probability, 
then this works as above. 


Definition 3.5.2 Let X and Y be discrete random variables, with P(Y = y) > 0. 
Then the conditional expectation of X, given Y = y, is equal to 


Pxy(,y) 
BCG s= X a szy = 2o 
xeR! xeR! PY 
EXAMPLE 3.5.2 
Suppose the joint probability function of X and Y is given by 
1/7 x=5,y=0 
1/7 x=5,y=3 
J 1/7 x=5,y=4 
PRICY) 377 x=8,y=0 
1/7 x=8,y=4 
0 otherwise. 
Then 
E(X|Y=0) = $ xPX=x|Y =0) 
xeR! 
= 5P(X=5|Y=0)+8P(X=8|Y =0) 
_ sPX=5, ¥=0) _ P(V=8, Y =0) 
PY =0) PY =0) 
1/7 3/7 29 
at pe 
1/7 +3/7 1/7+3/7 4 
Similarly, 
EX|¥=4) = Š xP(X=x]|Y =4) 
xeR! 
= 5P(X=5|Y=4)+8P(X=8|Y=4) 
1 1 
E ERA 13/2. 


OFA F 


Chapter 3: Expectation 175 


Also, 
E(XIY=3) = Š xPX=x]|Y =3)=5P(X=5|Y =3) 
xeR! 
1 
aS si 
1/7 


Sometimes we wish to condition on a random variable Y, without specifying in ad- 
vance on what value of Y we are conditioning. In this case, the conditional expectation 
E(X | Y) is itself a random variable — namely, it depends on the (random) value of Y 
that occurs. 


Definition 3.5.3 Let X and Y be discrete random variables. Then the conditional 
expectation of X, given Y, is the random variable E(X |Y), which is equal to 


E(X|Y = y) when Y = y. In particular, E(X |Y) is a random variable that 
depends on the random value of Y. 


EXAMPLE 3.5.3 
Suppose again that the joint probability function of X and Y is given by 
1/7 x=5,y=0 
1/7 x=5,y=3 
_ J 1/7 x=5,y=4 
Pxy%,y) = 3/7 waka 
1/7 x=8y=4 
0 otherwise. 


We have already computed that E(X | Y = 0) = 29/4, E(X |Y = 4) = 13/2, and 
E(X|Y =3) = 5. We can express these results together by saying that 


29/4 Y=0 
E(X|Y)=% 5 Y=3 
13/2 Y=4 


That is, E (X | Y) is a random variable, which depends on the value of Y. Note that, 
because P(Y = y) = 0 for y Æ 0, 3, 4, the random variable E(X | Y) is undefined in 
that case; but this is not a problem because that case will never occur. E 


Finally, we note that just like for regular expectation, conditional expectation is 
linear. 


Theorem 3.5.1 Let X1, X2, and Y be random variables; let A be an event; let a, b, 
and y be real numbers; and let Z = aX, + bX2. Then 
(a) E(Z| A) =aE(X%1|A)+5bE(X% | A). 


b) E(Z IY = y) =aE(Xı |Y =y) + bE |Y =y). 
(c) E(Z IY) =aE(X1|Y) +bE(X2 | Y). 


176 Section 3.5: Conditional Expectation 


3.5.2 | Absolutely Continuous Case 


Suppose now that X and Y are jointly absolutely continuous. Then conditioning on 
Y = y, for some particular value of y, seems problematic, because P(Y = y) = 0. 
However, we have already seen in Section 2.8.2 that we can define a conditional density 
Sx\y |v) that gives us a density function for X, conditional on Y = y. And because 
density functions give rise to expectations, similarly conditional density functions give 
rise to conditional expectations, as follows. 


Definition 3.5.4 Let X and Y be jointly absolutely continuous random variables, 
with joint density function fy y (x, y). Then the conditional expectation of X, given 
Y = y, is equal to 


k fxy y) TA 


ra=» = | frend = f 


xeR! Sy Q) 
EXAMPLE 3.5.4 
Let X and Y be jointly absolutely continuous, with joint density function fy, y given 
by 
| [at Hey? 0<x<1,0<y<1 
xy Gy) = | 0 otherwise. 


Then for0 < y <1, 
me : 2 5 5 
fro = | frend = f (4x2y + 2y5) dx =4y/3 +275. 
—00 
Hence, 


E(X|Y =y) 


i Ler) aca f a ys 
xeR! fyo) 0 4y/3 +2y5 


ss A e 
4y/3 +2y5 4/3 +2y4" 
As in the discrete case, we often wish to condition on a random variable without 


specifying in advance the value of that variable. Thus, E(X | Y) is again a random 
variable, depending on the random value of Y. 


Definition 3.5.5 Let X and Y be jointly absolutely continuous random variables. 
Then the conditional expectation of X, given Y, is the random variable E(X | Y), 


which is equal to E(X | Y = y) when Y = y. Thus, E(X | Y) is a random variable 
that depends on the random value of Y. 


EXAMPLE 3.5.5 
Let X and Y again have joint density 


_ ff 4x2y +25 O<x<l1,0<y<l 
Fx,y@,y¥) = | 0 otherwise. 


Chapter 3: Expectation 177 


We already know that E(X | Y = y) = (1 + y4) / (4/3 +2y4) . This formula is valid 
for any y between 0 and 1, so we conclude that E(X | Y) = (1 + Y*) / (4/3 + 2¥*) . 
Note that in this last formula, Y is a random variable, so E(X | Y) is also a random 
variable. E 


Finally, we note that in the absolutely continuous case, conditional expectation is 
still linear, i.e., Theorem 3.5.1 continues to hold. 


3.5.3 | Double Expectations 


Because the conditional expectation E (X | Y) is itself a random variable (as a function 
of Y), it makes sense to take its expectation, E (E(X | Y)). This is a double expectation. 
One of the key results about conditional expectation is that it is always equal to E(X). 


Theorem 3.5.2 (Theorem of total expectation) If X and Y are random variables, 
then E (E(X | Y)) = E(X). 


This theorem follows as a special case of Theorem 3.5.3 on the next page. But it 
also makes sense intuitively. Indeed, conditioning on Y will change the conditional 
value of X in various ways, sometimes making it smaller and sometimes larger, de- 
pending on the value of Y. However, if we then average over all possible values of Y, 
these various effects will cancel out, and we will be left with just F(X). 


EXAMPLE 3.5.6 
Suppose again that X and Y have joint probability function 


1/7 x=5,y=0 
1/7 x=5,y=3 
1/7 x=5,y=4 
Pxy@,y) = a x= R 
1/7 x=8y=4 
0 otherwise. 
Then we know that 
29/4 y=0 
E(X|Y=y)=} 5 y=3 
13/2 y=4 


Also, P(Y = 0) = 1/7+3/7 =4/7, P(Y =3) = 1/7, and P(Y =4) 
=1/7+1/7 =2/7. Hence, 
E(E(X|Y)) 
= $ EX =y) Pv =y) 
yeR! 
=E(X|Y =0P(Y =0)+ EX |Y =3)PY =3) + EX|Y =P (Y =4) 
= (29/4)(4/7) + (5)(1/7) + (13/2)(2/7) = 47/7. 


178 Section 3.5: Conditional Expectation 


On the other hand, we compute directly that E(X) = 5P(X = 5) + 8P(X = 8) = 
5(3/7) + 8(4/7) = 47/7. Hence, E (E(X | Y)) = E(X), as claimed. E 


EXAMPLE 3.5.7 
Let X and Y again have joint density 


_ f 4x2y +.2y° 0<x<1l,0<y<l 
fxy&œ,y)= | 0 otherwise. 


We already know that 
E(X|Y)= (1 ef y‘) / (4/3 + 27+) 


and that fy (y) = 4y/3 +2y° for 0 < y < 1. Hence, 
E (E(XIY)) 


1+y4 o0 
=E (zp) = [EAI =D rO 


l 1+y4 5 5 
= 2 ayia ope -f +y)dy =1/2 + 1/6 = 2/3. 
E zA AY/ y`) dy 0 y)dy =1/2+1/6 =2/ 
On the other hand, 


ee) ee) 1 1 
Ew = | | xture.nayae= [fx arty +2yayas 
—oo J—00 0 JO 


1 1 
f =+ = f 2x? +x/3)dx =2/4+ 1/6 =2/3. 
0 0 


Hence, E (E(X | Y)) = E(X), as claimed. E 


Theorem 3.5.2 is a special case (with g(y) = 1) of the following more general 
result, which in fact characterizes conditional expectation. 


Theorem 3.5.3 Let X and Y be random variables, and let g : R! > R! be any 


function. Then E(g(Y)E(X | Y)) = E(g(Y)X). 


PROOF | See Section 3.8 for the proof of this result. 


We also note the following related result. It says that, when conditioning on Y, any 
function of Y can be factored out since it is effectively a constant. 


Theorem 3.5.4 Let X and Y be random variables, and let g : R! —> R! be any 


function. Then E(g(Y)X |Y) = g(Y)E(X|Y). 


PROOF | See Section 3.8 for the proof of this result. 


Finally, because conditioning twice on Y is the same as conditioning just once on 
Y, we immediately have the following. 


Theorem 3.5.5 Let X and Y be random variables. Then E (E(X | Y)|Y) = E(X | Y). 


Chapter 3: Expectation 179 


3.5.4 | Conditional Variance (Advanced) 


In addition to defining conditional expectation, we can define conditional variance. As 
usual, this involves the expected squared distance of a random variable to its mean. 
However, in this case, the expectation is a conditional expectation. In addition, the 
mean is a conditional mean. 


Definition 3.5.6 If X is a random variable, and A is an event with P(A) > 0, then 
the conditional variance of X, given A, is equal to 


Var(X | A) = E(X — E(X | A))* | A) = E(X? | A) — (E(X| A)’. 


Similarly, if Y is another random variable, then 


Var(X|Y=y) = E(X-EX|Y=y)y|¥ =y) 
E(X |Y =y)-(E(X|Y =y))’, 


and Var(X | Y) = E (X — E(X | Y)? | Y) = E(X2|Y)-—(E(X|Y)y. 


EXAMPLE 3.5.8 
Consider again rolling a fair six-sided die, so that S = {1,2,3, 4,5,6}, with P (s) = 
1/6 and X (s) = s for s € S, and with A = {3, 5, 6}. We have already computed that 
P(X =s | A) = 1/3 fors € A, and that E(X | A) = 14/3. Hence, 

Var(X | A) = E (X — E(X 1 A)) | A) 

=E (x — 14/3) 14) = $ - 14/3} P(X =s | A) 

seS 
= (3 — 14/3)°(1/3) + (5 — 14/3)?(1/3) + (6 — 14/3)°(1/3) = 14/9 = 1.56. 


By contrast, because E(X) = 7/2, we have 


6 
Var(X) = E (is 2 Eœ) = $ œ — 7/2}?(1/6) = 35/12 = 2.92. 
x=1 


Hence, we see that the conditional variance Var(X | A) is much smaller than the uncon- 
ditional variance Var(X). This indicates that, in this example, once we know that event 
A has occurred, we know more about the value of X than we did originally. E 


EXAMPLE 3.5.9 
Suppose X and Y have joint density function 


f 8x 0<x<y<l 
fxy@,y) = | 0 otherwise. 


We have fy (y) = 4y%, Sxiy@ ly) = 8xy/4y> = 2x/y* for0 <x < y and so 


9, Y 2x2 2⁄3 2 
Bay == f xSdx=f Saxe =>. 
o y o y 3y 3 


180 Section 3.5: Conditional Expectation 


Therefore, 


Var(X |Y = y) 


E(X -EXIY =» IY=») 


y 2y\7 2x 1 8 4 
| a) ooa 


Finally, we note that conditional expectation and conditional variance satisfy the 
following useful identity. 


Theorem 3.5.6 For random variables X and Y, 


Var(X) = Var (E(X | Y)) + E (Var(X | Y)). 
PROOF | See Section 3.8 for the proof of this result. 


Summary of Section 3.5 


e If X is discrete, then the conditional expectation of X, given an event A, is equal 
to E(X| A) = $ eri xP(X =x | A). 

e If X and Y are discrete random variables, then E(X | Y) is itself a random vari- 
able, with E(X | Y) equal to E(X | Y = y) when Y = y. 


e If X and Y are jointly absolutely continuous, then E(X | Y) is itself a random 
variable, with E (X | Y) equal to E(X | Y = y) when Y = y, where E(X | Y = 
y) = fx fxy@ Ly) dx. 

Conditional expectation is linear. 

e We always have that E (g(Y) E(X | Y)) = E (g(Y) X), and E (E(X | Y)|Y) = 
E(X|Y). 

Conditional variance is given by Var(X | Y) = E(X? | Y) — (E(X | yyy. 


EXERCISES 


3.5.1 Suppose X and Y are discrete, with 


1/5 x=2,y=3 
1/5 x=3,y =2 
_ Jas t=3 y =3 
Px, yœ, y) = 1/5 x=2,y=2 
1/5 x=3,y=17 
0 otherwise. 


(a) Compute E(X | Y = 3). 
(b) Compute EF (Y | X = 3). 
(c) Compute E(X |Y). 
(d) Compute E (Y | X). 


Chapter 3: Expectation 181 


3.5.2 Suppose X and Y are jointly absolutely continuous, with 


fe eee 0<x<4,0<y<5 
0 


Sx,y@,y) = otherwise. 


(a) Compute fy (x). 

(b) Compute fy). 

(c) Compute E(X | Y). 

(d) Compute E (Y | X). 

(e) Compute E (E(X | Y)), and verify that it is equal to E (X). 
3.5.3 Suppose X and Y are discrete, with 


1/11 x = —4,y =2 
2/11 x = —4,y =3 
4/11 x = —4,y =7 
1/11 x=6,y=2 

Px) = F ri x=6,y=3 
1/11 x=6,y=7 
1/11 x=6,y=13 
0 otherwise. 


(a) Compute E (Y | X = 6). 

(b) Compute E (Y | X = —4). 

(c) Compute E (Y | X). 

3.5.4 Let py,y be as in the previous exercise. 

(a) Compute E(X |Y = 2). 

(b) Compute E(X |Y = 3). 

(c) Compute E(X |Y = 7). 

(d) Compute E(X | Y = 13). 

(e) Compute E(X | Y). 

3.5.5 Suppose that a student must choose one of two summer job offers. If it is not nec- 
essary to take a summer course, then a job as a waiter will produce earnings (rounded 
to the nearest $1000) with the following probability distribution. 


$1000 $2000 $3000 $4000 


0.1 0.3 0.4 0.2 


If it is necessary to take a summer course, then a part-time job at a hotel will produce 
earnings (rounded to the nearest $1000) with the following probability distribution. 


$1000 $2000 $3000 $4000 


0.3 0.4 0.2 0.1 


If the probability that the student will have to take the summer course is 0.6, then 
determine the student’s expected summer earnings. 

3.5.6 Suppose you roll two fair six-sided dice. Let X be the number showing on the 
first die, and let Z be the sum of the two numbers showing. 


182 Section 3.5: Conditional Expectation 


(a) Compute E(X). 

(b) Compute F(Z | X = 1). 

(c) Compute E(Z |X = 6). 

(d) Compute E(X | Z = 2). 

(e) Compute E(X |Z = 4). 

(£) Compute E(X | Z = 6). 

(g) Compute E(X | Z = 7). 

(h) Compute E(X | Z = 11). 

3.5.7 Suppose you roll two fair six-sided dice. Let Z be the sum of the two numbers 
showing, and let W be the product of the two numbers showing. 

(a) Compute E(Z| W = 4). 

(b) Compute E(W | Z = 4). 

3.5.8 Suppose you roll one fair six-sided die and then flip as many coins as the number 
showing on the die. (For example, if the die shows 4, then you flip four coins.) Let X 
be the number showing on the die, and Y be the number of heads obtained. 

(a) Compute E(Y | X = 5). 

(b) Compute E(X | Y = 0). 

(c) Compute E(X | Y = 2). 

3.5.9 Suppose you flip three fair coins. Let X be the number of heads obtained, and 
let Y = 1 if the first coin shows heads, otherwise Y = 0. 

(a) Compute E(X | Y = 0). 

(b) Compute E(X | Y = 1). 

(c) Compute E(Y | X = 0). 

(d) Compute E(Y | X = 1). 

(e) Compute E(Y | X = 2). 

(f) Compute E(Y | X = 3). 

(g) Compute E (Y | X). 

(b) Verify directly that E[E (Y | X)] = E (Y). 

3.5.10 Suppose you flip one fair coin and roll one fair six-sided die. Let X be the 
number showing on the die, and let Y = 1 if the coin is heads with Y = 0 if the coin is 
tails. Let Z = XY. 

(a) Compute E (Z). 

(b) Compute E (Z |X = 4). 

(c) Compute E (Y | X = 4). 

(d) Compute E (Y | Z = 4). 

(e) Compute E(X | Z = 4). 

3.5.11 Suppose X and Y are jointly absolutely continuous, with joint density function 
fy (x,y) = (6/19) (x? +y?) for 0 <x <2and0 < y < 1, otherwise fy, y(x, y) = 
0 


(a) Compute E(X). 

(b) Compute E (Y). 

(c) Compute E(X |Y). 

(d) Compute E (Y | X). 

(e) Verify directly that E[E(X | Y)] = E(X). 
(£) Verify directly that E[E (Y | X)] = E (Y). 


Chapter 3: Expectation 183 


PROBLEMS 


3.5.12 Suppose there are two ums. Urn I contains 100 chips: 30 are labelled 1, 40 
are labelled 2, and 30 are labelled 3. Urn 2 contains 100 chips: 20 are labelled 1, 
50 are labelled 2, and 30 are labelled 3. A coin is tossed and if a head is observed, 
then a chip is randomly drawn from urn 1, otherwise a chip is randomly drawn from 
urn 2. The value Y on the chip is recorded. If an occurrence of a head on the coin 
is denoted by X = 1, a tail by X = 0, and X ~ Bemoulli(3/4), then determine 
E(X|Y), EW |X), E(™), and E(X). 

3.5.13 Suppose that five coins are each tossed until the first head is obtained on each 
coin and where each coin has probability 0 of producing a head. If you are told that the 
total number of tails observed is Y = 10, then determine the expected number of tails 
observed on the first coin. 

3.5.14 (Simpson’s paradox) Suppose that the conditional distributions of Y, given X, 
are shown in the following table. For example, py x (1 |i) could correspond to the 
probability that a randomly selected heart patient at hospital i has a successful treat- 
ment. 


prx OID prx) 
0.030 0.970 


pyx 012)  pyyx (1/2) 
0.020 0.980 


(a) Compute E (Y | X). 

(b) Now suppose that patients are additionally classified as being seriously ill (Z = 1), 
or not seriously ill (Z = 0). The conditional distributions of Y, given (X, Z), are 
shown in the following tables. Compute E (Y | X, Z). 


pryix,z 0|1,0) pyx,z 11,0) 
0.010 0.990 
pryix,z 0|2,0) pyıx,z (112, 0) 
0.013 0.987 


pryix,z 0|1,1) pyx,z Q |1,1) 
0.038 0.962 


pyix,z 0|2,1) pyjx,z 12, ) 
0.040 0.960 


(c) Explain why the conditional distributions in part (a) indicate that hospital 2 is the 
better hospital for a patient who needs to undergo this treatment, but all the conditional 
distributions in part (b) indicate that hospital 1 is the better hospital. This phenomenon 
is known as Simpson’s paradox. 

(d) Prove that, in general, pyx (v |x) = >, pyix,z O |x, z) pzx Z |x) and E(Y | X) 
= E(E(Y |X, Z)|X). 

(e) Ifthe conditional distributions pz x (- |x), corresponding to the example discussed 
in parts (a) through (c) are given in the following table, verify the result in part (d) 


184 Section 3.6: Inequalities 


numerically and explain how this resolves Simpson’s paradox. 


pzxOll) pzxd|) 
0.286 0.714 


pzx 012) pzx C12) 
0.750 0.250 


3.5.15 Present an example of a random variable X, and an event A with P(A) > 0, 
such that Var(X | 4) > Var(X). (Hint: Suppose S = {1,2,3} with X(s) = s, and 
A = {l, 3}.) 

3.5.16 Suppose that X, given Y = y, is distributed Gamma(a, y) and that the marginal 
distribution of Y is given by 1/ Y ~ Exponential(/) . Determine E (X). 

3.5.17 Suppose that (X, Y) ~ Bivariate Normal (u1, “2,01, 02, p). Use (2.7.1) (when 
given Y = y) and its analog (when given X = x) to determine E(X |Y), E(Y |X), 
Var(X | Y), and Var(Y | X). 

3.5.18 Suppose that (X1, X2, X3) ~ Multinomial(7, 01, 02,03). Determine E (X1 | X2) 
and Var(X1 | X2). (Hint: Show that X1, given X2 = x2, has a binomial distribution.) 
3.5.19 Suppose that (X1, X2) ~ Dirichlet(a1, a2, a3). Determine E(Xı | X2) and 
Var(Xı | X2). (Hint: First show that X1/(1 — x2), given X2 = x2, has a beta dis- 
tribution and then use Problem 3.3.24.) 

3.5.20 Let fy y be as in Exercise 3.5.2. 

(a) Compute Var(X). 

(b) Compute Var(E(X | Y)). 

(c) Compute Var(X | Y). 

(d) Verify that Var(X) = Var(E (X | Y)) + E (Var(X | Y)). 

3.5.21 Suppose we have three discrete random variables X, Y, and Z. We say that X 
and Y are conditionally independent, given Z, if 


Px,y|z &, y |z) = pxiz & |z) pyiz Q |2) 


for every x, y, and z such that P (Z = z) > 0. Prove that when X and Y are condition- 
ally independent, given Z, then 


E OAY) |Z) = E B62 E A)| Z). 


3.6 | Inequalities 


Expectation and variance are closely related to the underlying distributions of random 
variables. This relationship allows us to prove certain inequalities that are often very 
useful. We begin with a classic result, Markov’s inequality, which is very simple but 
also very useful and powerful. 


Chapter 3: Expectation 185 


Theorem 3.6.1 (Markov’s inequality) If X is a nonnegative random variable, then 


E(X) 
~= 


foralla > 0, 
P(X >a) < 


That is, the probability that X exceeds any given value a is no more than the mean 
of X divided by a. 


PROOF | Define a new random variable Z by 


a X >a 
z=| 5 X <a. 


Then clearly Z < X, so that E(Z) < E(X) by monotonicity. On the other hand, 


EZ) =aP(Z =a)+0P(Z =0) =a P(Z =a) =a P(X > a). 


So, E(X) > E(Z) =a P(X > a). Rearranging, P(X > a) < E(X)/a, as claimed. E 


Intuitively, Markov’s inequality says that if the expected value of X is small, then 
it is unlikely that X will be too large. We now consider some applications of Theorem 
3.6.1. 


EXAMPLE 3.6.1 

Suppose P(X = 3) = 1/2, P(X = 4) = 1/3, and P(X = 7) = 1/6. Then E(X) = 
3(1/2) + 4(1/3) + 7(1/6) = 4. Hence, setting a = 6, Markov’s inequality says that 
P(X > 6) < 4/6 = 2/3. In fact, P(X > 6) = 1/6 < 2/3.8 


EXAMPLE 3.6.2 

Suppose P(X = 2) = P(X = 8) = 1/2. Then E(X) = 2(1/2) + 8(1/2) = 5. 
Hence, setting a = 8, Markov’s inequality says that P(X > 8) < 5/8. In fact, 
P(X > 8) = 1/2 < 5/8.0 


EXAMPLE 3.6.3 

Suppose P(X = 0) = P(X = 2) = 1/2. Then E(X) = 0(1/2) + 2(1/2) = 1. 
Hence, setting a = 2, Markov’s inequality says that P(X > 2) < 1/2. In fact, 
P(X > 2) = 1/2, so Markov’s inequality is an equality in this case. E 


Markov’s inequality is also used to prove Chebychev’s inequality, perhaps the most 
important inequality in all of probability theory. 


Theorem 3.6.2 (Chebychev’s inequality) Let Y be an arbitrary random variable, 
with finite mean wy. Then for all a > 0, 


Var(Y) 
a? ` 


P(Y - uy| 2 a) < 


PROOF | Set X = (Y — uy)?. Then X is a nonnegative random variable. Thus, using 
Theorem 3.6.1, we have P (|Y — wy| > a) = P (X > a°) < E(X)/a? =Var(Y)/a’, 
and this establishes the result. E 


186 Section 3.6: Inequalities 


Intuitively, Chebychev’s inequality says that if the variance of Y is small, then it 
is unlikely that Y will be too far from its mean value wy. We now consider some 
examples. 


EXAMPLE 3.6.4 

Suppose again that P(X = 3) = 1/2, P(X = 4) = 1/3, and P(X = 7) = 1/6. 
Then E(X) = 4, as above. Also, E(X?) = 9(1/2) + 16(1/3) + 49(1/6) = 18, so 
that Var(X) = 18 — 4? = 2. Hence, setting a = 1, Chebychev’s inequality says 
that P(X — 4| > 1) < 2/1? = 2, which tells us nothing because we always have 
P(|X —4| > 1) < 1. On the other hand, setting a = 3, we get P(X — 4| > 3) < 
2/3? = 2/9, which is true because in fact P (|X — 4| > 3) = P(X =7) =1/6 < 2/9. 
| 


EXAMPLE 3.6.5 

Let X ~ Exponential(3), and leta = 5. Then E(X) = 1/3 and Var(X) = 1/9. Hence, 
by Chebychev’s inequality with a = 1/2, P(X —1/3| > 1/2) < (1/9)/(1/2)? = 4/9. 
On the other hand, because X > 0, P(X — 1/3] > 1/2) = P(X > 5/6), and 
by Markov’s inequality, P(X > 5/6) < (1/3)/(5/6) = 2/5. Because 2/5 < 4/9, we 
actually get a better bound from Markov’s inequality than from Chebychev’s inequality 
in this case. E 


EXAMPLE 3.6.6 
Let Z ~ N(0, 1), and a = 5. Then by Chebychev’s inequality, P(|Z| > 5) < 1/5.8 


EXAMPLE 3.6.7 

Let X be a random variable having very small variance. Then Chebychev’s inequality 
says that P (|X — u y| > a) is small whenever a is not too small. In other words, usually 
|X — u y| is very small, i.e., X ~ wy. This makes sense, because if the variance of X 
is very small, then usually X is very close to its mean value u y. E 


Inequalities are also useful for covariances, as follows. 


Theorem 3.6.3 (Cauchy—Schwartz inequality) Let X and Y be arbitrary random 
variables, each having finite, nonzero variance. Then 


|Cov(X, Y)| < \/Var(X) Var (Y). 


Furthermore, if Var(Y) > 0, then equality is attained if and only if X — wy = 
A(Y — uy) where 4 =Cov(X, Y)/Var(Y). 


PROOF | See Section 3.8 for the proof. E 


The Cauchy—Schwartz inequality says that if the variance of X or Y is small, then 
the covariance of X and Y must also be small. 


EXAMPLE 3.6.8 

Suppose X = C is a constant. Then Var(X) = 0. It follows from the Cauchy— 
Schwartz inequality that, for any random variable Y, we must have Cov(X, Y) < 
(Var(X) Var(Y))!/2 = (0 Var(Y))!/? = 0, so that Cov(X, Y) = 0.1 


Chapter 3: Expectation 187 


Recalling that the correlation of X and Y is defined by 
Cov(X, Y) 
VVar) Var) 


we immediately obtain the following important result (which has already been referred 
to, back when correlation was first introduced). 


Corr(X, Y) = 


Corollary 3.6.1 Let X and Y be arbitrary random variables, having finite means 
and finite, nonzero variances. Then |Corr(X, Y)| < 1. Furthermore, |Corr(X, Y)| = 
1 if and only if 


_ Cov(X, Y) 


X-ux= Œ - uy). 


Var (Y) 


So the correlation between two random variables is always between —1 and 1. We 
also see that X and Y are linearly related if and only if |Corr(X, Y)| = 1, and that 
this relationship is increasing (positive slope) when Corr(X, Y) = 1 and decreasing 
(negative slope) when Corr(X, Y) = —1. 


3.6.1 | Jensen’s Inequality (Advanced) 


Finally, we develop a more advanced inequality that is sometimes very useful. A func- 
tion f is called convex if for every x < y, the line segment from (x, f(x)) to w, f(v)) 
lies entirely above the graph of f, as depicted in Figure 3.6.1. 


f 600 + 


500 F 
400 T 
300 F 


200 F 


100 + 


Figure 3.6.1: Plot of the convex function f(x) = x4 and the line segment joining (2, f(@)) to 
(4, f (4)). 


In symbols, we require that for every x < y and every 0 < 4 < 1, we have 
Af) ALAO) = fUx + Ud -— å)y). Examples of convex functions include 
f(x) =x?, f(x) = x4, and f(x) = max(x, C) for any real number C. We have the 
following. 


Theorem 3.6.4 (Jensen's inequality) Let X be an arbitrary random variable, and let 
f : R! — R! bea convex function such that E(f (X)) is finite. Then f(E(X)) < 


E(f(X)). Equality occurs if and only if f (X) =a + bX for some a and b. 


188 Section 3.6: Inequalities 


PROOF | Because f is convex, we can find a linear function g(x) = ax + b such 
that g(E(X)) = f(E(X)) and g(x) < f(x) for all x € R! (see, for example, Figure 
3.6.2). 


f 250 


Figure 3.6.2: Plot of the convex function f(x) = x4 and the function 
g(x) = 81 + 108(x — 3), satisfying g(x) < f(x) on the interval (2, 4). 


But then using monotonicity and linearity, we have E(f(X)) > E(g(X)) = 
E(aX +b) =aE(X)+b= g(E(X)) = f(E(X)), as claimed. 

We have equality if and only if0 = E (f (X) — g(X)). Because f(X) — g(X) > 0, 
this occurs (using Challenge 3.3.29) if and only if f(X) = g(X) = aX + b with 


probability 1. E 

EXAMPLE 3.6.9 

Let X be a random variable with finite variance. Then setting f(x) = x7, Jensen’s 
inequality says that E(X?) > (E(X))*. Of course, we already knew this because 
E(X?) — (E(X))? =Var(X) > 0.0 

EXAMPLE 3.6.10 

Let X be a random variable with finite fourth moment. Then setting f(x) = x4, 
Jensen’s inequality says that E(X*) > (E(X))*. 1 


EXAMPLE 3.6.11 

Let X be a random variable with finite mean, and let M € R!. Then setting f(x) = 
max(x, M), we have that E(max(X, M)) > max(E (X), M) by Jensen’s inequality. In 
fact, we could also have deduced this from the monotonicity property of expectation, 
using the two inequalities max(X, M) > X and max(X, M) > M.U 


Summary of Section 3.6 


e For nonnegative X, Markov’s inequality says P(X > a) < E(X)/a. 

e Chebychev’s inequality says P(\Y — uy| > a) < Var(Y)/a. 

e The Cauchy—Schwartz inequality says |Cov(X, Y)| < (Var(X) Var(Y D 2 so 
that |Corr(X, Y)| < 1. 


Chapter 3: Expectation 189 


e Jensen’s inequality says f(E(X)) < E(f (X)) whenever f is convex. 


EXERCISES 


3.6.1 Let Z ~ Poisson(3). Use Markov’s inequality to get an upper bound on P(Z > 
7). 

3.6.2 Let X ~ Exponential(5). Use Markov’s inequality to get an upper bound on 
P(X > 3) and compare it with the precise value. 

3.6.3 Let X ~ Geometric(1 /2). 

(a) Use Markov’s inequality to get an upper bound on P(X > 9). 

(b) Use Markov’s inequality to get an upper bound on P(X > 2). 

(c) Use Chebychev’s inequality to get an upper bound on P(|X — 1| > 1). 

(d) Compare the answers obtained in parts (b) and (c). 

3.6.4 Let Z ~ N(5, 9). Use Chebychev’s inequality to get an upper bound on P(|Z — 
5| > 30). 

3.6.5 Let W ~ Binomial(100, 1/2), as in the number of heads when flipping 100 fair 
coins. Use Chebychev’s inequality to get an upper bound on P (|W — 50| > 10). 

3.6.6 Let Y ~ N (0, 100), and let Z ~ Binomial (80, 1/4). Determine (with explana- 
tion) the largest and smallest possible values of Cov (Y, Z). 

3.6.7 Let X ~ Geometric(1/11). Use Jensen’s inequality to determine a lower bound 
on E(X*), in two different ways. 

(a) Apply Jensen’s inequality to X with f(x) = x4. 

(b) Apply Jensen’s inequality to X? with f(x) = x?. 

3.6.8 Let X be the number showing on a fair six-sided die. What bound does Cheby- 
chev’s inequality give for P(X > 5 or X < 2)? 

3.6.9 Suppose you flip four fair coins. Let Y be the number of heads obtained. 

(a) What bound does Chebychev’s inequality give for P(Y > 3 or Y < 1)? 

(b) What bound does Chebychev’s inequality give for P(Y > 4 or Y < 0)? 

3.6.10 Suppose W has density function f(w) = 3w? for 0 < w < 1, otherwise 
f(w) =0. 

(a) Compute E (W). 

(b) What bound does Chebychev’s inequality give for P(|\W — E(W)| > 1/4)? 

3.6.11 Suppose Z has density function f(z) = z3/4 for 0 < z < 2, otherwise f(z) = 
0. 

(a) Compute E (Z). 

(b) What bound does Chebychev’s inequality give for P(|Z — E(Z)| > 1/2)? 

3.6.12 Suppose Var(X) = 4 and Var(Y) = 9. 

(a) What is the largest possible value of CovCX, Y)? 

(b) What is the smallest possible value of Cov(X, Y)? 

(c) Suppose Z = 3.X/2. Compute Var(Z) and Cov(X, Z), and compare your answer 
with part (a). 

(d) Suppose W = —3X/2. Compute Var(W) and Cov(W, Z), and compare your an- 
swer with part (b). 


190 Section 3.6: Inequalities 


3.6.13 Suppose a species of beetle has length 35 millimeters on average. Find an upper 
bound on the probability that a randomly chosen beetle of this species will be over 80 
millimeters long. 


PROBLEMS 


3.6.14 Prove that for any € > 0 and > 0, there is a positive integer M, such that if X 
is the number of heads when flipping M fair coins, then P(|(X/M) — (1/2)| > 6) < €. 


3.6.15 Prove that for any u and g? > 0, there is a > 0 anda random variable X with 
E(X) = wand Var(X) = o°, such that Chebychev’s inequality holds with equality, 
i.e., such that P (|X — u| > a) =07/a?. 

3.6.16 Suppose that (X, Y) is uniform on the set {(x1, y1),..., Œn, ¥n)} where the 
X1,...,Xpy are distinct values and the y1, ..., Yn are distinct values. 

(a) Prove that X is uniformly distributed on x1,...,Xn, with mean given by x = 
n7! $/L x; and variance given by 52, =n! Ss a. 

(b) Prove that the correlation coefficient between X and Y is given by 


oi Ga OD) 


where Syy =n7! 1 @i — xX) Gi — Y) . The value Sxy is referred to as the sample 
covariance andr yy is referred to as the sample correlation coefficient when the values 
(x1, y1), -.- Xn, Yn) are an observed sample from some bivariate distribution. 


(c) Argue that ryy is also the correlation coefficient between X and Y when we drop 
the assumption of distinctness for the x; and y;. 

(d) Prove that —1 < ryy < 1 and state the conditions under whichryy = +1. 

3.6.17 Suppose that X is uniformly distributed on {x;,...,x,} and so has mean x = 
n~! 1 x; and variance § =n! X Qi — X)? (see Problem 3.6.16(a)). What is 
the largest proportion of the values x; that can lie outside (¥ — 26, ¥ + 28x)? 

3.6.18 Suppose that X is distributed with density given by fy(x) = 2/x? forx > 1 
and is 0 otherwise. 

(a) Prove that fy is a density. 

(b) Calculate the mean of X. 

(c) Compute P(X > k) and compare this with the upper bound on this quantity given 
by Markov’s inequality. 

(d) What does Chebyshev’s inequality say in this case? 

3.6.19 Let g(x) = max(—x, —10). 

(a) Verify that g is a convex function. 

(b) Suppose Z ~ Exponential(5). Use Jensen’s inequality to obtain a lower bound on 
E(g(Z)). 

3.6.20 It can be shown that a function f, with continuous second derivative, is convex 
on (a, b) if f”(x) > 0 for all x e (a,b). 

(a) Use the above fact to show that f(x) = x? is convex on (0, oo) whenever p > 1. 


Chapter 3: Expectation 191 


(b) Use part (a) to prove that (E (x) > |E (X)| whenever p > 1. 
(c) Prove that Var(X) = 0 if and only if X is degenerate at a constant c. 


CHALLENGES 


3.6.21 Determine (with proof) all functions that are convex and whose negatives are 
also convex. (That is, find all functions f such that f is convex, and also — f is 
convex.) 


3.7 | General Expectations (Advanced) 


So far we have considered expected values separately for discrete and absolutely con- 
tinuous random variables only. However, this separation into two different “cases” may 
seem unnatural. Furthermore, we know that some random variables are neither discrete 
nor continuous — for example, mixtures of discrete and continuous distributions. 

Hence, it seems desirable to have a more general definition of expected value. Such 
generality is normally considered in the context of general measure theory, an advanced 
mathematical subject. However, it is also possible to give a general definition in ele- 
mentary terms, as follows. 


Definition 3.7.1 Let X be an arbitrary random variable (perhaps neither discrete 
nor continuous). Then the expected value of X is given by 


0 


BQ) = | P> pa- | P(X < t)dt, 


provided either [°° P(X > t)dt < oo or f? P(X < t)dt < oo. 


This definition appears to contradict our previous definitions of E (X). However, in 
fact, there is no contradiction, as the following theorem shows. 


Theorem 3.7.1 

(a) Let X be a discrete random variable with distinct possible values x1, x2,..., 
and put p; = P(X = x;). Then Definition 3.7.1 agrees with the previous definition 
of E (X). That is, 


lee) 0 
f PÆ >d- | BOEDE D 
0 CO i 


(b) Let X be an absolutely continuous random variable with density fy. Then 
Definition 3.7.1 agrees with the previous definition of E (X). That is, 


ce i)dt - fre <t)dt = | sods. 


PROOF | The key to the proof is switching the order of the integration/summation. 


192 Section 3.7: General Expectations (Advanced) 


(a) We have 
oO Co Xi 
f PX > Har = f y ride = Sv f dt =>" pixi, 
0 0 ix>t i 0 i 
as claimed. 
(b) We have 


[ra t) dt 
0 


is (S rwa) dt = (f iwar) dx 


= f> fx(x)dx. 
0 


Similarly, 


[pe <nat = [Af sores) a= f (f roa)a 


0 
Z f E 


Hence, 


oe) 0 oe) 0 
| P(X > t)dt -f Pa < t)dt | x fx(x) dx - | CDd 


a fxx)dx, 


as claimed. E 


In other words, Theorem 3.7.1 says that Definition 3.7.1 includes our previous defi- 
nitions of expected value, for both discrete and absolutely continuous random variables, 
while working for any random variable at all. (Note that to apply Definition 3.7.1 we 
take an integral, not a sum, regardless of whether X is discrete or continuous!) 

Furthermore, Definition 3.7.1 preserves the key properties of expected value, as 
the following theorem shows. (We omit the proof here, but see Challenge 3.7.10 for a 
proof of part (c).) 


Theorem 3.7.2 Let X and Y be arbitrary random variables, perhaps neither discrete 
nor continuous, with expected values defined by Definition 3.7.1. 
(a) (Linearity) Ifa and b are any real numbers, then E(aX+bY) = aE (X)+bE (Y). 


(b) If X and Y are independent, then E (XY) = E(X) E (Y). 
(c) (Monotonicity) If X < Y, then E(X) < E(Y). 


Definition 3.7.1 also tells us about expected values of mixture distributions, as fol- 
lows. 


Chapter 3: Expectation 193 


Theorem 3.7.3 For 1 <i < k, let Y; be a random variable with cdf F;. Let X be 
a random variable whose cdf corresponds to a finite mixture (as in Section 2.5.4) 


of the cdfs of the Y;, so that Fy(x) = >); pi Fi(x), where p; > 0 and >); pi = 1. 


Then E(X) = J; pi E(Y;). 


PROOF | We compute that 


PX > = 1-FyQ=1-> pA A 


> pill — F;(t)) = > piP > t). 


Similarly, 
P(X < t) =Fyt) = Do ii) = > piP O <t). 
i i 


Hence, from Definition 3.7.1, 


o0 0 
E(X) | P> par- | P(X < t)dt 


ee) 0 
iP; > Dar- | iP(Yi < t)dt 
f pa ) PP >, ) 


Ea [re > pa- f Po; <ar) 


D PiE), 


as claimed. E 


Summary of Section 3.7 


e For general random variables, we can define a general expected value by E (X) = 
(CO PX > tdt — f P(X < t)dt. 

e This definition agrees with our previous one, for discrete or absolutely continu- 
ous random variables. 


e General expectation is still linear and monotone. 


EXERCISES 

3.7.1 Let X1, X2, and Y be as in Example 2.5.6, so that Y is a mixture of X and X2. 
Compute E (X1), E (X2), and E (Y). 

3.7.2 Suppose we roll a fair six-sided die. If it comes up 1, then we roll the same die 
again and let X be the value showing. If it comes up anything other than 1, then we 


194 Section 3.8: Further Proofs (Advanced) 


instead roll a fair eight-sided die (with the sides numbered 1 through 8), and let X be 
the value showing on the eight-sided die. Compute the expected value of X. 

3.7.3 Let X be a positive constant random variable, so that X = C for some constant 
C > 0. Prove directly from Definition 3.7.1 that E (X) = C. 


3.7.4 Let Z be a general random variable (perhaps neither discrete nor continuous), 
and suppose that P(Z < 100) = 1. Prove directly from Definition 3.7.1 that E (Z) < 
100. 

3.7.5 Suppose we are told only that P(X > x) = 1/x? for x > 1, and P(X > x) =1 
for x < 1, but we are not told if X is discrete or continuous or neither. Compute E (X). 
3.7.6 Suppose P(Z > z) = 1 forz < 5, P(Z > z) = (8 — z)/3 for5 < z < 8, and 
P(Z > z) =0 forz > 8. Compute F(Z). 

3.7.7 Suppose P(W > w) = e>” for w > O and P(W > w) = 1 fr w < 0. 
Compute E(W). 

3.7.8 Suppose P(Y > y) = en /2 for y > Oand P(Y > y) = 1 for y < 0. Compute 
E(Y). (Hint: The density of a standard normal might help you solve the integral.) 


3.7.9 Suppose the cdf of W is given by Fyw(w) = 0 for w < 10, Fy(w) = w — 10 
for 10 < w < 11, and by Fw(w) = 1 for w > 11. Compute E(W). (Hint: Remember 
that Fw (w) = PW < w) = 1 — P(W > w).) 


CHALLENGES 


3.7.10 Prove part (c) of Theorem 3.7.2. (Hint: If X < Y, then how does the event 
{X > t} compare to the event {Y > t}? Hence, how does P(X > t) compare to 
P(Y > t)? And what about {X < t} and {Y < t}?) 


3.8 | Further Proofs (Advanced) 
Proof of Theorem 3.4.7 


We want to prove that if S has a compound distribution as in (3.4.2), then (a) E (S) = 
E(X1) E(N) and (b) ms(s) =rn(mx,(6)). 
Because the {X;} are i.i.d., we have E(X;) = E(X}) for alli. Define J; by J; = 
Tq1,...,vy (i). Then we can write S = X72; X;1;. Also note that X£} J; = N. 
Because N is independent of X;, so is J;, and we have 


E(S) = E (Sx) = SEH) 
i=l i=l 


= Ý EDEN) =$ EXEN) 
i=l j 


i=l 


= EX) $ EU) = E(Xı)E (È 1) 


= E(X\)E(N). 


Chapter 3: Expectation 195 


This proves part (a). 
Now, using an expectation version of the law of total probability (see Theorem 


3.5.3), and recalling that £ (exp ($ ;—1 s X;)) = my, (s)” because the {Xj} are i.i.d., 
we compute that 


ms(s) 


= #(e0(Sox)) = S PW =n)E (0(Ss x) |N = n) 
i=l n=0 i=l 
= SPN =n)E (> (È x)) = SPN =n)mx,(s)" 
n=0 i=l n=0 
= E(mx,(s)") =rn(mx,(s)), 
thus proving part (b). E 


Proof of Theorem 3.5.3 


We want to show that when X and Y are random variables, and g : R! — R! is any 
function, then E (g(Y) E(X | Y)) = E (g(Y%) X). 
If_X and Y are discrete, then 


E(g(YEXIY)) = >) gQ)EXIY =y) PW =y) 

yeR 

= > 80) (Zraz) =» 
yeR eR! 

P(X =x, Y=y) 

220) 2 TS ) Y =y) 

= J di see =x, Y =y) = E(X), 
xeR! yeR! 

as claimed. 


Similarly, if X and Y are jointly absolutely continuous, then 


E(e(Y)E(XIY)) = | soaren 


2 J 2() (/ x fartoby)ds) fro) dy 


ae O°  fxy@, y) 
= j 0 (> FO ax) ee. 


= i / g(x fur, y)dxdy =E@W)X), 


as claimed. E 


196 Section 3.8: Further Proofs (Advanced) 


Proof of Theorem 3.5.4 


We want to prove that, when X and Y are random variables, and g : R! — R! is any 
function, then E(g(Y) X |Y) = (Y) E(X | Y). 

For simplicity, we assume X and Y are discrete; the jointly absolutely continuous 
case is similar. Then for any y with P(Y = y) > 0, 


E@OX|Y = y= J gex PX =x,Y =z|Y =y) 


xeR! zeR! 
= Š goxP=x]|Y =y) 
xeR! 
= gy) $ xPO =x|Y =y)=g0)E(X|Y =y). 
xeR! 


Because this is true for any y, we must have E(e(Y)X |Y) = g(Y)E(X |Y), as 
claimed. E 


Proof of Theorem 3.5.6 


We need to show that for random variables X and Y, Var(X) =Var(E(X|Y)) + 
E (Var(X |Y)). 
Using Theorem 3.5.2, we have that 


Var(X) = E((X — u y)’) = E (E(X — ux? 1Y)). (3.8.1) 


(X — wy)? = (X — E(X | Y) + E(X |Y) - uy? 
= (X — E(X|¥))? + (E(X|Y) — nx)’ 
+2(X — E(X|Y)) (E(X|Y)—- ux). (3.8.2) 


But E ((X — E(X | Y))* | Y) =Var(X | Y). 
Also, again using Theorem 3.5.2, 


E(E(E(X|¥) = Hx)’ 1Y) = EEX IY) -= ux’) = Var EX 1Y). 
Finally, using Theorem 3.5.4 and linearity (Theorem 3.5.1), we see that 


E(X- EXI EXIY)-a4y) IY) 

= (E(X|Y)- uy) E (X -EX |Y)IY) 

= (EXI Y) - ux) (EX|Y) -E E&KINIÐ) 
= (E(X Y)- ux) ŒEXIY)-EXIY) = 0. 


From (3.8.1), (3.8.2), and linearity, we have that Var(X) = E (Var(X | Y)) 
+ Var(E (X | Y)) +0, which completes the proof. E 


Chapter 3: Expectation 197 


Proof of Theorem 3.6.3 (Cauchy—Schwartz inequality) 


We will prove that whenever X and Y are arbitrary random variables, each having 
finite, nonzero variance, then 


|Cov(X, Y)| < y Var(X) Var(Y). 


Furthermore, if Var(Y) > 0, then equality is attained if and only if X — wy = 1(Y — 
Ly) where à = Cov(X, Y)/ Var(Y). 

If Var(Y) = 0, then Challenge 3.3.30 implies that Y = wy with probability 1 
(because Var(Y) = E((Y — wy)*) > 0). This implies that 


Cov (X, Y) = E (X — u x) (uy — wy)) = 0 = Var (X) Var (Y), 


and the Cauchy—Schwartz inequality holds. 
If Var(Y) 4 0, let Z = X — wy and W = Y — wy. Then for any real number 4, 
we compute, using linearity, that 


E (Z z aw) = E(Z*)—21E(ZW) + 12E(W?) 


Var(X) — 24 Cov(X, Y) + 2? Var(Y) 
al? + ble, 


where a =Var(Y) > 0,b = —2Cov(X, Y), and c = Var(X). On the other hand, 
clearly E ((Z = AW)?) > 0 for all 2. Hence, we have a quadratic equation that is 
always nonnegative, and so has at most one real root. 

By the quadratic formula, any quadratic equation has two real roots provided that 
the discriminant b? — 4ac > 0. Because that is not the case here, we must have 
b? — 4ac < 0, i.e., 

4Cov(X, y)? — 4Var(Y) Var(X) < 0. 


Dividing by 4, rearranging, and taking square roots, we see that 
|Cov(X, Y)| < (Var(X) Var(Y))!/? , 


as claimed. 

Finally, |Cov(X, Y)| = (Var(X) Var(Y))!/2 if and only if b? — 4ac = 0, which 
means the quadratic has one real root. Thus, there is some real number / such that 
E((Z —i1W)*) = 0. Since (Z — AW)? > 0, it follows from Challenge 3.3.29 that this 
happens if and only if Z — AW = 0 with probability 1, as claimed. When this is the 
case, then 


Cov(X, Y) = E (ZW) = E(AW?) = AE(W*) = 14 Var (Y) 


and so 4 =Cov(X, Y)/ Var(Y) when Var(Y) Æ 0.8 


Chapter 4 


Sampling Distributions and 
Limits 


CHAPTER OUTLINE 


Section 1 Sampling Distributions 
Section 2 Convergence in Probability 
Section 3 Convergence with Probability 1 
Section 4 Convergence in Distribution 
Section 5 Monte Carlo Approximations 
Section 6 Normal Distribution Theory 
Section 7 Further Proofs (Advanced) 


In many applications of probability theory, we will be faced with the following prob- 
lem. Suppose that X1, X2,..., Xn is an identically and independently distributed 
(i.id.) sequence, i.e., X1, X2,..., Xn is a sample from some distribution, and we 
are interested in the distribution of a new random variable Y = h(X,, X2,..., Xn) for 
some function A. In particular, we might want to compute the distribution function of 
Y or perhaps its mean and variance. The distribution of Y is sometimes referred to as 
its sampling distribution, as Y is based on a sample from some underlying distribution. 

We will see that some of the methods developed in earlier chapters are useful in 
solving such problems — especially when it is possible to compute an exact solution, 
e.g., obtain an exact expression for the probability or density function of Y. Section 
4.6 contains a number of exact distribution results for a variety of functions of normal 
random variables. These have important applications in statistics. 

Quite often, however, exact results are impossible to obtain, as the problem is just 
too complex. In such cases, we must develop an approximation to the distribution of 
y. 

For many important problems, a version of Y is defined for each sample size n (e.g., 
a sample mean or sample variance), so that we can consider a sequence of random 
variables Y1, Y2,..., etc. This leads us to consider the limiting distribution of such 
a sequence so that, when n is large, we can approximate the distribution of Y,, by the 


199 


200 Section 4.1: Sampling Distributions 


limit, which is often much simpler. This approach leads to a famous result, known as 
the central limit theorem, discussed in Section 4.4. 

Sometimes we cannot even develop useful approximations for large n, due to the 
difficulty of the problem or perhaps because n is just too small in a particular applica- 
tion. Fortunately, however, we can then use the Monte Carlo approach where the power 
of the computer becomes available. This is discussed in Section 4.5. 

In Chapter 5 we will see that, in statistical applications, we typically do not know 
much about the underlying distribution of the X; from which we are sampling. We then 
collect a sample and a value, such as Y, that will serve as an estimate of a characteristic 
of the underlying distribution, e.g., the sample mean X will serve as an estimate of 
the mean of the distribution of the X;. We then want to know what happens to these 
estimates as n grows. If we have chosen our estimates well, then the estimates will 
converge to the quantities we are estimating as n increases. Such an estimate is called 
consistent. In Sections 4.2 and 4.3, we will discuss the most important consistency 
theorems — namely, the weak and strong laws of large numbers. 


4.1 | Sampling Distributions 


Let us consider a very simple example. 


EXAMPLE 4.1.1 
Suppose we obtain a sample X1, X2 of size n = 2 from the discrete distribution with 
probability function given by 


1/2 x=1 
1/4 x =2 
PX (x) = ve x =3 
0 otherwise. 


Let us take Yo = (X1X2)!/ 2 This is the geometric mean of the sample values (the 
geometric mean of n positive numbers x1, ..., Xn is defined as (x1 ---Xn)!/"). 

To determine the distribution of Y2, we first list the possible values for Y2, the 
samples that give rise to these values, and their probabilities of occurrence. The values 
of these probabilities specify the sampling distribution of Y. We have the following 


table. 
{(1, 1)} (1/2)(1/2) = 1/4 
{(1, 2), 2, D} | (/2)0/4) + /4)C./2) = 1/4 
{(1, 3), (1, 3)} | /2)C1/4) + (1/4) 1/2) = 1/4 
{(2, 2)} (1/4)(1/4) = 1/16 
{(2, 3), (3,2)} | /4)C/4) + (1/4) (1/4) = 1/8 
{3, 3)} (1/4)(1/4) = 1/16 
Now suppose instead we have a sample X1, ..., X20 of size n = 20, and we want to 


find the distribution of Y29 = (X1 -- - Xy9)!/ mua Obviously, we can proceed as above, 
but this time the computations are much more complicated, as there are now 37° 
3,486,784,401 possible samples, as opposed to the 32 = 9 samples used to form the 


Chapter 4: Sampling Distributions and Limits 201 


previous table. Directly computing Py», as we have done for py,, would be onerous 
— even for a computer! So what can we do here? 

One possibility is to look at the distribution of Y, = (X1 --- Xn) '/" When n is large 
and see if we can approximate this in some fashion. The results of Section 4.4.1 show 
that 


1 n 
In Y, = — > InX; 
nN 4 
i=l 


has an approximate normal distribution when z is large. In fact, the approximating nor- 
mal distribution when n = 20 turns out to be an N (0.447940, 0.105167) distribution. 
We have plotted this density in Figure 4.1.1. 

Another approach is to use the methods of Section 2.10 to generate N samples of 
size n = 20 from py, calculate In Y2ọ for each (In is a 1-1 transformation, and we 
transform to avoid the potentially large values assumed by Y29), and then use these 
N values to approximate the distribution of In Y29. For example, in Figure 4.1.2 we 
have provided a plot of a density histogram (see Section 5.4.3 for more discussion of 
histograms) of N = 104 values of In Y29 calculated from N = 10+ samples of size n = 
20 generated (using the computer) from py. The area of each rectangle corresponds 
to the proportion of values of In Y29 that were in the interval given by the base of the 
rectangle. As we will see in Sections 4.2, 4.3, and 4.4, these areas approximate the 
actual probabilities that In Y29 falls in these intervals. These approximations improve 
as we increase N. 

Notice the similarity in the shapes of Figures 4.1.1 and 4.1.2. Figure 4.1.2 is not 
symmetrical about its center, however, as it is somewhat skewed. This is an indication 
that the normal approximation is not entirely adequate when n = 20. E 


44 
3 I 
Pa 
2 2 
oO 
a 
41 i> it 
0 E 
| | | 
0.0 0.5 1.0 
InY 


Figure 4.1.1: Plot of the approximating N (0.447940, 0.105167) density to the distribution of 
InY 79 in Example 4.1.1. 


202 Section 4.1: Sampling Distributions 


Density 


T 
0.0 0.5 1.0 
InY 


Figure 4.1.2: Plot of N = 104 values of In Y29 obtained by generating N = 104 samples from 
px in Example 4.1.1. 


Sometimes we are lucky and can work out the sampling distribution of 
Y =h(X, X2,..., Xn) 


exactly in a form useful for computing probabilities and expectations for Y. In general, 
however, when we want to compute P(Y e B) = Py(B), we will have to determine 
the set of samples (X1, X2,..., Xn) such that Y € B, as given by 


h'B = {(%1, X2, .--, Xn) : h(x1,%2,..-5Xn) E B}, 


and then compute P ((X1, X2,..., Xn) € h7! B). This is typically an intractable prob- 
lem and approximations or simulation (Monte Carlo) methods will be essential. Tech- 
niques for deriving such approximations will be discussed in subsequent sections of 
this chapter. In particular, we will develop an important approximation to the sampling 
distribution of the sample mean 


E 1 
E = W(X, Xo Xn) == Di. 
i=l 


Summary of Section 4.1 


e A sampling distribution is the distribution of a random variable corresponding to 
a function of some i.i.d. sequence. 

e Sampling distributions can sometimes be computed by direct computation or by 
approximations such as the central limit theorem. 


Chapter 4: Sampling Distributions and Limits 203 


EXERCISES 


4.1.1 Suppose that X1, X2, X3 are iid. from py in Example 4.1.1. Determine the 
exact distribution of Y3 = (X1X2X3)!/3. 

4.1.2 Suppose that a fair six-sided die is tossed n = 2 independent times. Compute 
the exact distribution of the sample mean. 

4.1.3 Suppose that an urn contains a proportion p of chips labelled 0 and proportion 
1 — p of chips labelled 1. For a sample of n = 2, drawn with replacement, determine 
the distribution of the sample mean. 

4.1.4 Suppose that an urn contains N chips labelled 0 and M chips labelled 1. For a 
sample ofn = 2, drawn without replacement, determine the distribution of the sample 
mean. 

4.1.5 Suppose that a symmetrical die is tossed n = 20 independent times. Work out 
the exact sampling distribution of the maximum of this sample. 

4.1.6 Suppose three fair dice are rolled, and let Y be the number of 6’s showing. Com- 
pute the exact distribution of Y. 

4.1.7 Suppose two fair dice are rolled, and let W be the product of the two numbers 
showing. Compute the exact distribution of W. 

4.1.8 Suppose two fair dice are rolled, and let Z be the difference of the two numbers 
showing (i.e., the first number minus the second number). Compute the exact distribu- 
tion of Z. 

4.1.9 Suppose four fair coins are flipped, and let Y be the number of pairs of coins 
which land the same way (i.e., the number of pairs that are either both heads or both 
tails). Compute the exact distribution of Y. 


COMPUTER EXERCISES 


4.1.10 Generate a sample of N = 10° values of Yso in Example 4.1.1. Calculate the 
mean and standard deviation of this sample. 

4.1.11 Suppose that X1, X2,..., X19 is an i.i.d. sequence from an N(0, 1) distribu- 
tion. Generate a sample of N = 103 values from the distribution of max(X1, X2,..., 
X10). Calculate the mean and standard deviation of this sample. 


PROBLEMS 


4.1.12 Suppose that X1, X2,..., Xn is a sample from the Poisson(/) distribution. De- 
termine the exact sampling distribution of Y = X1 + X2 +- -- + Xn. (Hint: Determine 
the moment-generating function of Y and use the uniqueness theorem.) 

4.1.13 Suppose that X1, X2 is a sample from the Uniform[0, 1] distribution. Determine 
the exact sampling distribution of Y = X; + X2. (Hint: Determine the density of Y.) 
4.1.14 Suppose that X1, X2 is a sample from the Uniform[0, 1] distribution. Determine 
the exact sampling distribution of Y = (X,X2)!/?. (Hint: Determine the density of 
In Y and then transform.) 


204 Section 4.2: Convergence in Probability 


4.2 | Convergence in Probability 


Notions of convergence are fundamental to much of mathematics. For example, if 
an = 1 —1/n, then a; = 0, a2 = 1/2, a3 = 2/3, a4 = 3/4, etc. We see that the 
values of a, are getting “closer and closer” to 1, and indeed we know from calculus 
that limno &n = | in this case. 

For random variables, notions of convergence are more complicated. If the values 
themselves are random, then how can they “converge” to anything? On the other hand, 
we can consider various probabilities associated with the random variables and see if 
they converge in some sense. 

The simplest notion of convergence of random variables is convergence in prob- 
ability, as follows. (Other notions of convergence will be developed in subsequent 
sections.) 


Definition 4.2.1 Let X1, X2,... be an infinite sequence of random variables, and 
let Y be another random variable. Then the sequence {Xn} converges in probability 


to Y, if for all € > 0, limno P (|Xn — Y| > €) = 0, and we write X, > Y. 


In Figure 4.2.1, we have plotted the differences X,, — Y, for selected values of n, 
for 10 generated sequences {X;, — Y} for a typical situation where the random variables 
Xn converge to a random variable Y in probability. We have also plotted the horizontal 
lines at +e for € = 0.25. From this we can see the increasing concentration of the 
distribution of X,, — Y about 0, as n increases, as required by Definition 4.2.1. In fact, 
the 10 observed values of X100 — Y all satisfy the inequality |X100 — Y| < 0.25. 


200 — 
Fi 
1.00 — 
~ 
e E E E EE E Bed 
c t Es 
oO 
0.25 —}----------5 Pianist See Bh os se th E kB 
D z + 
E + 
To + 
1.00 | + 
7 
-200 — 
a 
1 a0 25 50 100 


n 


Figure 4.2.1: Plot of 10 replications of {Xn —Y } illustrating the convergence in probability of 
Xn to Y. 


We consider some applications of this definition. 


Chapter 4: Sampling Distributions and Limits 205 


EXAMPLE 4.2.1 
Let Y be any random variable, and let X1 = X2 = X3 =--- = Y. (That is, the random 
variables are all identical to each other.) In that case, |X, — Y| = 0, so of course 


lim P(X, -Y| >€)=0 
n= o0 


for all € > 0. Hence, Xn 5 Y.E 

EXAMPLE 4.2.2 

Suppose P(X, = 1 — 1/n) = 1 and P(Y = 1) = 1. Then P (|X, —Y| >6) =0 
whenever n > 1/e. Hence, P(|X, — Y| > €) — 0asn — oœ for all e > 0. Hence, 
the sequence {X,,} converges in probability to Y. (Here, the distributions of X„ and Y 
are all degenerate.) B 


EXAMPLE 4.2.3 
Let U ~ Uniform[0, 1]. Define X, by 


3 U<2—1 
Xy =3 a 
4 | 8 otherwise, 
and define Y by 


2 
Y= 3 U< 3: 
8 otherwise. 


Then 
2a] 2 1 
P(Ks-Vi26) <P Us en=P(F-7<U ss). 


Hence, P (|X, — Y| > €) ~ Oas n > o forall € > 0, and the sequence {X,,} con- 
verges in probability to Y. (This time, the distributions of X, and Y are not degenerate.) 
| 


A common case is where the distributions of the X, are not degenerate, but Y is 
just a constant, as in the following example. 


EXAMPLE 4.2.4 
Suppose Z,, ~ Exponential(7) and let Y = 0. Then 


CO 
P(|Z, -Y| > €) = P(Z, > €) =f ne "dx =e", 
€ 
Hence, again P(|Z, — Y| > €) — 0asn > o forall e > 0, so the sequence {Z,} 
converges in probability to Y. E 
4.2.1 | The Weak Law of Large Numbers 


One of the most important applications of convergence in probability is the weak law 
of large numbers. Suppose X1, X2, ... is a sequence of independent random variables 
that each have the same mean u. For large n, what can we say about their average 


1 
Mn =t tAn)? 


206 Section 4.2: Convergence in Probability 


We refer to Mn as the sample average, or sample mean, for X1, ..., Xn. When the 
sample size n is fixed, we will often use X as a notation for sample mean instead of 
Mn. 

For example, if we flip a sequence of fair coins, and if X; = 1 or X; = 0 as the ith 
coin comes up heads or tails, then M, represents the fraction of the first n coins that 
came up heads. We might expect that for large n, this fraction will be close to 1/2, i.e., 
to the expected value of the Xj. 

The weak law of large numbers provides a precise sense in which average values 
M, tend to get close to E(X;), for large n. 


Theorem 4.2.1 (Weak law of large numbers) Let X1, X2, ... be a sequence of inde- 
pendent random variables, each having the same mean u and each having variance 
less than or equal to v < oo. Then for all € > 0, lim, oo P (Mn — u| > €) = 0. 


; ; a P 
That is, the averages converge in probability to the common mean u or Mn > u. 


PROOF | Using linearity of expected value, we see that E(M,) = u. Also, using 
independence, we have 


1 
Var(M,) = za Va) + Var(X2) +--- + Var(Xn)) 
1 1 
< =Otot+:--+0) = (mw) =v/n. 
n n 
Hence, by Chebychev’s inequality (Theorem 3.6.2), we have 
P (Mn — u| > ©) < Var(M,)/e? < v/en. 


This converges to 0 as n —> oo, which proves the theorem. E 


It is a fact that, in Theorem 4.2.1, if we require the X; variables to be i.i.d. instead 
of merely independent, then we do not even need the X; to have finite variance. But we 
will not discuss this result further here. Consider some applications of the weak law of 
large numbers. 


EXAMPLE 4.2.5 

Consider flipping a sequence of identical fair coins. Let M, be the fraction of the first 
n coins that are heads. Then M, = (X1 +---+ X,)/n, where X; = 1 if the ith coin 
is heads, otherwise X; = 0. Hence, by the weak law of large numbers, we have 


lim P(M, < 0.49) = lim P(M, — 0.5 < —0.01) 
noo n= 


IA 


lim P(M, — 0.5 < —0.01 or M, — 0.5 > 0.01 ) 
noo 


lim P(|M, —0.5| > 0.01) =0 
n—- oo 


and, similarly, lim,_55, P(Mn > 0.51) = 0. This illustrates that for large n, it is very 
likely that M, is very close to 0.5. E 


Chapter 4: Sampling Distributions and Limits 207 


EXAMPLE 4.2.6 

Consider flipping a sequence of identical coins, each of which has probability p of 
coming up heads. Let M, again be the fraction of the first n coins that are heads. Then 
by the weak law of large numbers, for any € > 0, limy—yoo P(p—€ < Mn < pte) =1. 
We thus see that for large n, it is very likely that M, is very close to p. (The previous 
example corresponds to the special case p = 1/2.) E 

EXAMPLE 4.2.7 

Let X1, X2,... be iid. with distribution N (3, 5). Then E(M,,) = 3, and by the weak 
law of large numbers, P(3 — € < Mn <3 + €) — lasn > ow. Hence, for large n, 
the average value M, is very close to 3. E 

EXAMPLE 4.2.8 

Let W1, W2, ... be i.i.d. with distribution Exponential(6). Then E (M,) = 1/6, and by 
the weak law of large numbers, P(1/6 — € < Mn < 1/6 + €) > lasn > ow. 
Hence, for large n, the average value M, is very close to 1/6. E 


Summary of Section 4.2 


e A sequence {X,,} of random variables converges in probability to Y if 
lim P([X, — Y| > €)=0. 
noo 


e The weak law of large numbers says that if {Xn} is i.1.d. (or is independent with 
constant mean and bounded variance), then the averages Mn = (X1 + --- + 
Xn)/n converge in probability to E(X;). 


EXERCISES 


4.2.1 Let U ~ Uniform[5, 10], and let Z = Jyeqs,7) and Zn = Tyejs, T+1/n2): Prove 
that Zn — Z in probability. 

4.2.2 Let Y ~ Uniform[0, 1], and let X, = Y”. Prove that XY, — 0 in probability. 
4.2.3 Let W1, W2,...bei.i.d. with distribution Exponential(3). Prove that for some n, 
we have P(W, + W2 +---+ Wn < n/2) > 0.999. 

4.2.4 Let Y1, Yo,... be i.i.d. with distribution N (2, 5). Prove that for some n, we have 
P(Yi + Y2 +--+ Yn > n) > 0.999. 

4.2.5 Let X1, X2, ... be i.i.d. with distribution Poisson(8). Prove that for some n, we 
have P(X, + X2 +--+: +X, > 9n) < 0.001. 

4.2.6 Suppose X ~ Uniform[0, 1], and let Y, = zy, Prove that Y,, 5 X. 

4.2.7 Let H, be the number of heads when flipping n fair coins, let X, = e7™, and 
let Y = 0. Prove that X, R Y. 

4.2.8 Let Zn ~ Uniform[0, n], let W, = 5Z,/(Z, + 1), and let W = 5. Prove that 
W, > W. 

4.2.9 Consider flipping n fair coins. Let H, be the total number of heads, and let Fn 
be the number of heads on coins | through n — 1 (i.e., omitting the nth coin). Let 
Xn = H, /(Hn + 1), and Yp = Fy/(Hy + 1), and Z = 0. Prove that X, — Y, 5 Z. 


208 Section 4.3: Convergence with Probability 1 


4.2.10 Let Z, be the sum of the squares of the numbers showing when we roll n fair 
dice. Find (with proof) a number m such that lZ, 5 m. (Hint: Use the weak law of 
large numbers.) 

4.2.11 Consider flipping n fair nickels and n fair dimes. Let X, equal 4 times the 
number of nickels showing heads, plus 5 times the number of dimes showing heads. 


Find (with proof) a number r such that 1x, n 5 r. 


COMPUTER EXERCISES 


4.2.12 Generate i.i.d. X1, ..., Xn distributed Exponential(5) and compute M, when 
n = 20. Repeat this N times, where N is large (if possible, take N = 10°, otherwise 
as large as is feasible), and compute the proportion of values of M, that lie between 
0.19 and 0.21. Repeat this with n = 50. What property of convergence in probability 
do your results illustrate? 

4.2.13 Generate iid. X1, ..., Xn distributed Poisson(7) and compute M, when n = 
20. Repeat this N times, where N is large (if possible, take N = 10°, otherwise as 
large as is feasible), and compute the proportion of values of M, that lie between 6.99 
and 7.01. Repeat this with n = 100. What property of convergence in probability do 
your results illustrate? 


PROBLEMS 


4.2.14 Give an example of random variables X1, X2, ... such that {X,,} converges to 
0 in probability, but E (Xn) = 1 for all n. (Hint: Suppose P(X, = n) = 1/n and 
P(X, =0) =1-1/n.) 


4.2.15 Prove that X, 3 0 if and only if |X;,| Zs 0. 
4.2.16 Prove or disprove that X, En 5 if and only if |X, | 4 5. 


4.2.17 Suppose X, > X, and Yp —> Y. Let Zn = Xn + Y and Z = X + Y. Prove 
that Z, Š Z. 


CHALLENGES 


4.2.18 Suppose Xn 5 X, and f is a continuous function. Prove that f (Xn) =n fX). 


4.3 | Convergence with Probability 1 


A notion of convergence for random variables that is closely associated with the con- 
vergence of a sequence of real numbers is provided by the concept of convergence with 
probability 1. This property is given in the following definition. 


Definition 4.3.1 Let X1, X2,... be an infinite sequence of random variables. We 
shall say that the sequence {X;} converges with probability I (or converges almost 


surely (a.s.)) to a random variable Y, if P (limp 59 Xn = Y) = 1 and we write 


Chapter 4: Sampling Distributions and Limits 209 


In Figure 4.3.1, we illustrate this convergence by graphing the sequence of differ- 
ences {X, — Y} for a typical situation where the random variables X„ converge to a 
random variable Y with probability 1. We have also plotted the horizontal lines at +e 
for € = 0.1. Notice that inevitably all the values X;,, — Y are in the interval (—0.1, 0.1) 
or, in other words, the values of X, are within 0.1 of the values of Y. 

Definition 4.3.1 indicates that for any given € > 0, there will exist a value Ne 
such that |X, — Y| < € for every n > Ne. The value of Ne will vary depending on 
the observed value of the sequence {X, — Y}, but it always exists. Contrast this with 
the situation depicted in Figure 4.2.1, which only says that the probability distribution 
Xn — Y concentrates about 0 as n grows and not that the individual values of X, — Y 
will necessarily all be near 0 (also see Example 4.3.2). 


By a ae Ra EN EE 


differences 


0.2 4 


03 — 


04 5 


0 500 1000 


Figure 4.3.1: Plot of a single replication {X;, —Y } illustrating the convergence with probability 
1 of X, to Y. 


Consider an example of this. 


EXAMPLE 4.3.1 
Consider again the setup of Example 4.2.3, where U ~ Uniform[0, 1], 


3 eset 
= = 13 n 
An | 8 otherwise 


2 
yak? U S3. 
8 otherwise. 


If U > 2/3, then Y = 8 and also X, = 8 for all n, so clearly X, —> Y. IfU < 2/3, 
then for large enough n we will also have 

2—1 

U SS SS), 

3 n 
so again X, — Y. On the other hand, if U = 2/3, then we will always have X, = 8, 
even though Y = 3. Hence, X, — Y except when U = 2/3. Because P(U = 2/3) = 
0, we do have X;, — Y with probability 1. B 


210 Section 4.3: Convergence with Probability 1 


One might wonder what the relationship is between convergence in probability and 
convergence with probability 1. The following theorem provides an answer. 


Theorem 4.3.1 Let Z, Z1, Z2,... be random variables. Suppose Zn — Z with 


probability 1. Then Z, — Z in probability. That is, if a sequence of random 
variables converges almost surely, then it converges in probability to the same limit. 


PROOF | See Section 4.7 for the proof of this result. E 


On the other hand, the converse to Theorem 4.3.1 is false, as the following example 
shows. 


EXAMPLE 4.3.2 
Let U have the uniform distribution on [0, 1]. We construct an infinite sequence of 
random variables {X,,} by setting 


Xi = iona U), X2 = Inj2,4W), 
X3 = Tfo1jay(U), X4 = Iuy4,1/3(U), Xs = 11/2,3/4)(U), Xo = 13/4,1)U), 
X7 = Io,1/3)(U), Xs = I1/8,1/4)(U), --- 


where 74 is the indicator function of the event A, i.e., I4 (s) = 1 ifs € A, and J4(s) = 
Oifs ZA. 

Note that we first subdivided [0, 1] into two equal-length subintervals and defined 
Xı and_X2 as the indicator functions for the two subintervals. Next we subdivided [0, 1] 
into four equal-length subintervals and defined X3, X4, X5, and X6 as the indicator 
functions for the four subintervals. We continued this process by next dividing [0, 1] 
into eight equal-length subintervals, then 16 equal-length subintervals, etc., to obtain 
an infinite sequence of random variables. 

Each of these random variables X, takes the values 0 and 1 only and so must follow 
a Bernoulli distribution. In particular, X; ~ Bernoulli(1/2) , X2 ~ Bernoulli (1 /2) , X3 
~ Bernoulli(1 /4) , etc. 

Then for 0 < e < 1, we have that P(|X, — O| > €) = P(X, = 1). Because 
the intervals for U that make X, # 0 are getting smaller and smaller, we see that 
P(X, = 1) is converging to 0. Hence, X„ converges to 0 in probability. 

On the other hand, X„ does not converge to 0 almost surely. Indeed, no matter what 
value U takes on, there will always be infinitely many different n for which X, = 1. 
Hence, we will have X, = 1 infinitely often, so that we will not have X, converging 
to 0 for any particular value of U. Thus, P (limpo Xn — 0) = 0, and X, does not 
converge to 0 with probability 1. E 


Theorem 4.3.1 and Example 4.3.2 together show that convergence with probability 1 is 
a stronger notion than convergence in probability. 

Now, the weak law of large numbers (Section 4.2.1) concludes only that the av- 
erages M, are converging in probability to E(X;). A stronger version of this result 
would instead conclude convergence with probability 1. We consider that now. 


Chapter 4: Sampling Distributions and Limits 211 


4.3.1| The Strong Law of Large Numbers 


The following is a strengthening of the weak law of large numbers because it concludes 
convergence with probability 1 instead of just convergence in probability. 


Theorem 4.3.2 (Strong law of large numbers) Let X1, X2,... be a sequence of 
i.i.d. random variables, each having finite mean u. Then 


P ( lim Mn =1) 2i 


That is, the averages converge with probability 1 to the common mean u or Mn = 
LL. 


PROOF | See 4 First Look at Rigorous Probability Theory, Second Edition, by J. S. 
Rosenthal (World Scientific Publishing Co., 2006) for a proof of this result. E 


This result says that sample averages converge with probability 1 to u. 

Like Theorem 4.2.1, it says that for large n the averages Mn are usually close to 
u = E(X;) for large n. But it says in addition that if we wait long enough (i.e., if n 
is large enough), then eventually the averages will all be close to u, for all sufficiently 
large n. In other words, the sample mean is consistent for u. 


Summary of Section 4.3 


e A sequence {X,,} of random variables converges with probability 1 (or converges 
almost surely) to Y if, P (limno Xn = Y) = 1. 
e Convergence with probability 1 implies convergence in probability. 


e The strong law of large numbers says that if {X;,} is i.id., then the averages 
Mn = (X%1 +- - -+ Xn)/n converge with probability 1 to E(X;). 


EXERCISES 


4.3.1 Let U ~ Uniform[5, 10], and let Z = J[5,7) (U) (i.e., Z is the indicator function 
of [5, 7)) and Zn = Jis, 741/n2) (U). Prove that Z, — Z with probability 1. 

4.3.2 Let Y ~ Uniform[0, 1], and let X, = Y”. Prove that X, — 0 with probability 
1. 

4.3.3 Let W1, W2, ... be iid. with distribution Exponential(3). Prove that with prob- 
ability 1, for some n, we have W1 + Wo +--- + Wn < n/2. 

4.3.4 Let Y1, Yo,... be i.i.d. with distribution N (2, 5). Prove that with probability 1, 
for some n, we have Y + Y2 +---+Y, >n. 

4.3.5 Suppose X;, —> X with probability 1, and also Y,, —> Y with probability 1. Prove 
that P(X, > X and Y, > Y)=1. 

4.3.6 Suppose Z1, Z2,... are i.i.d. with finite mean u. Let M, = (Zi +--+ Zp)/n. 
Determine (with explanation) whether the following statements are true or false. 


212 Section 4.3: Convergence with Probability 1 


(a) With probability 1, M, = u for some n. 

(b) With probability 1, u — 0.01 < Mn < u +0.01 for some n. 

(c) With probability 1, u — 0.01 < Mn < u + 0.01 for all but finitely many n. 

(d) For any x € R!, with probability 1, x — 0.01 < M, < x + 0.01 for some n. 

4.3.7 Let {Xn} be i.i.d., with Xn ~ Uniform[3, 7]. Let Y, = (X1 + X2 +... + Xn)/n. 
Find (with proof) a number m such that Y, Sm. (Hint: Use the strong law of large 
numbers.) 

4.3.8 Let Z, be the sum of the squares of the numbers showing when we roll n fair 
dice. Find (with proof) a number m such that Zn SS im. 


4.3.9 Consider flipping n fair nickels and n fair dimes. Let X, equal 4 times the 
number of nickels showing heads, plus 5 times the number of dimes showing heads. 


Find (with proof) a number r such that +X, Sr. 


4.3.10 Suppose Yn 23 Y. Does this imply that P(|Y¥s5 — Y| > |Y4 — Y|) = 0? Explain. 
4.3.11 Consider repeatedly flipping a fair coin. Let H, be the number of heads on the 
first n flips, and let Zn = Hy, /n. 

(a) Prove that there is some m such that |Z, — 1/2] < 0.001 for all n > m. 

(b) Let r be the smallest positive integer satisfying |Z, — 1/2] < 0.001. Must we have 
|Z, — 1/2| < 0.001 for all n > r? Why or why not? 

4.3.12 Suppose P(X = 0) = P(X = 1) = 1/2, and let X, = X forn = 1,2,3,.... 
(That is, the random variables X, are all identical.) Let Yn = (X1+X2+...+Xn)/n. 
(a) Prove that P (limy_soo Yn = 0) = P (limo Yn = 1) = 1/2. 

(b) Prove that there is no number m such that P (limn—oo Yn = m) = 1. 


(c) Why does part (b) not contradict the law of large numbers? 


COMPUTER EXERCISES 


4.3.13 Generate iid. X1, ..., Xn distributed Exponential(5) with n large (take n = 
10° if possible). Plot the values M1, Mz,..., M,,. To what value are they converging? 
How quickly? 

4.3.14 Generate iid. X1, .. . , Xn distributed Poisson(7) with n large (take n = 10° if 


possible). Plot the values M1, M2,..., Mn. To what value are they converging? How 
quickly? 
4.3.15 Generate iid. X1, X2,..., Xn distributed N(—4, 3) with n large (taken = 10° 
if possible). Plot the values M1, M2, ..., Mn. To what value are they converging? How 
quickly? 


PROBLEMS 


4.3.16 Suppose for each positive integer k, there are random variables Wz, Xx,1, Xx,2, 
... such that P (limp oo Xk,n = Wk) = 1. Prove that P (limp oo Xk,n = We for all k) 
= 1. 

4.3.17 Prove that X,, 45 0 if and only if [X;,| => 0: 

4.3.18 Prove or disprove that X, “S 5 if and only if |X;,| ay os 


Chapter 4: Sampling Distributions and Limits 213 


4.3.19 Suppose X, 45 X, and Y, “> Y. Let Z, = X„ + Y, and Z = X + Y. Prove 
that Z, S Z. 


CHALLENGES 


4.3.20 Suppose for each real number r € [0, 1], there are random variables W,., X, 1, 


X;2,... Such that P (limps 00 Xr,n = W,) = 1. Prove or disprove that we must have 
P (üM, Xn, = W, for allr e [0,1]) = 1. 
4.3.21 Give an example of random variables X1, X2,... such that {X,} converges to 


0 with probability 1, but E (X,) = 1 for all n. 
4.3.22 Suppose Xn “SX, and f is a continuous function. Prove that f (Xn) 5 f(X). 


4.4 | Convergence in Distribution 


There is yet another notion of convergence of a sequence of random variables that is 
important in applications of probability and statistics. 


Definition 4.4.1 Let X, X1, X2, ... be random variables. Then we say that the 
sequence {X,,} converges in distribution to X, if for all x € R! such that P(X = 


x) = 0 we have limno P(Xn < x) = P(X < x), and we write Xn 3 X 


Intuitively, {X,} converges in distribution to X if for large n, the distribution of Xn 
is close to that of X. The importance of this, as we will see, is that often the distribution 
of Xn is difficult to work with, while that of X is much simpler. With X, converging 
in distribution to X, however, we can approximate the distribution of X, by that of X. 


EXAMPLE 4.4.1 
Suppose P(X, = 1) = 1/n, and P(X, = 0) = 1—1/n. Let X = 0 so that 
P(X = 0) = 1. Then, 


0 x <0 0 <0 
P(X, <x)=] 1—1/n 0<x<1 > P(X <x)= 7 
1 l<x pe Oe 


asn — oo. As P(X, < x) P(X < x) for every x, and in particular at all x 
where P(X = x) = 0, we have that {X,,} converges in distribution to X. Intuitively, as 
n — œ, it is more and more likely that X, will equal 0. E 


EXAMPLE 4.4.2 

Suppose P(X, = 1) = 1/2 + 1/n, and P(X, = 0) = 1/2 — 1/n. Suppose further 
that P(X = 0) = P(X = 1) = 1/2. Then {X,,} converges in distribution to X because 
P(X = 1) > 1/2 and P(X, = 0) > 1/2asn > w.I 


EXAMPLE 4.4.3 
Let X ~ Uniform[0, 1], and let P(X, =i/n) = 1/n fori = 1,2,...,n. Then X is 
absolutely continuous, while X, is discrete. On the other hand, for any 0 < x < 1, we 


214 Section 4.4: Convergence in Distribution 


have P(X < x) = x, and letting |x] denote the greatest integer less than or equal to 
x, we have 


PÆ, <x) = E. 

n 
Hence, |P (Xn <x) — P(X < x)| < 1/n forall n. Because lim,_,., 1/n = 0, we do 
indeed have X, — X in distribution. E 


EXAMPLE 4.4.4 
Suppose X1, X2, ... are i.i.d. with finite mean u, and M, = (X1 +- - -+ Xn)/n. Then 
the weak law of large numbers says that for any € > 0, we have 


P(Ma<u-—-e6e)>0 and P(M,<ute 1 


asn — œ. It follows that limp—5o, P(Mn < x) = P(M < x) for any x # u, where 
M is the constant random variable M = u. Hence, M, —> M in distribution. Note that 
it is not necessarily the case that P(M, < u) > P(M < u) = 1. However, this does 
not contradict the definition of convergence in distribution because P(M = u) Æ 0, 
so we do not need to worry about the case x = u. E 


EXAMPLE 4.4.5 Poisson Approximation to the Binomial 
Suppose Xn ~ Binomial(n, 2/n) and X ~ Poisson(4). We have seen in Example 


2.3.6 that i 
AN? As a? 
P(X), =-)=(j) (=) a=) 3 e%*— 
n n j! 


asn — oo. This implies that Fy, (x) —> Fy(x) at every point x ¢ {0,1,2,...}, and 
these are precisely the points for which P(X = x) = 0. Therefore, {X;,} converges in 
distribution to X. (Indeed, this was our original motivation for the Poisson distribution.) 
| 


Many more examples of convergence in distribution are given by the central limit 
theorem, discussed in the next section. We first pause to consider the relationship of 
convergence in distribution to our previous notions of convergence. 


Theorem 4.4.1 If X, —> X, then X, > X. 


PROOF | See Section 4.7 for the proof of this result. E 


The converse to Theorem 4.4.1 is false. Indeed, the fact that X„ converges in 
distribution to X says nothing about the underlying relationship between Xn and X, 
it says only something about their distributions. The following example illustrates this. 


EXAMPLE 4.4.6 

Suppose X, X1, X2,... are i.i.d., each equal to +1 with probability 1/2 each. In this 
case, P(X, <x) = P(X < x) forall n and for all x € R!, so of course X, converges 
in distribution to X. On the other hand, because X and X, are independent, 


1 
P (IX — Xıl aam 


Chapter 4: Sampling Distributions and Limits 215 


for all n, which does not go to 0 as n —> oo. Hence, X, does not converge to X in 
probability (or with probability 1). So we can have convergence in distribution without 
having convergence in probability or convergence with probability 1. E 


The following result, stated without proof, indicates how moment-generating func- 
tions can be used to check for convergence in distribution. (This generalizes Theo- 
rem 3.4.6.) 


Theorem 4.4.2 Let X be a random variable, such that for some sọ > 0, we have 
mx(s) < œ whenever s € (—so, so). If Z1, Z2, ... is a sequence of random vari- 


ables with mz, (s) < oo and limy5o0 mz, (s) = mx(s) for all s € (—s0, so), then 
{Zn} converges to X in distribution. 


We will make use of this result to prove one of the most famous theorems of probability 
— the central limit theorem. 

Finally, we note that combining Theorem 4.4.1 with Theorem 4.3.1 reveals the 
following. 


Corollary 4.4.1 If X, — X with probability 1, then X, 3 X 


4.4.1 |The Central Limit Theorem 


We now present the central limit theorem, one of the most important results in all of 
probability theory. Intuitively, it says that a large sum of i.i.d. random variables, prop- 
erly normalized, will always have approximately a normal distribution. This shows 
that the normal distribution is extremely fundamental in probability and statistics — 
even though its density function is complicated and its cumulative distribution function 
is intractable. 

Suppose X1, X2, ... is an i.i.d. sequence of random variables each having finite 
mean y and finite variance o?. Let S, = Xi +--+ X, be the sample sum and 
Mn = Sn/n be the sample mean. The central limit theorem is concerned with the 
distribution of the random variable 


Sn—nu Mn-p Mn - u 
Z = ——_—_—_ = eR. = a: 
É /no a//n w o ? 


where o = Vo”. We know E (Mn) = u and Var(M,) = o7/n, which implies that 
E(Z,) = 0 and Var(Z,) = 1. The variable Z, is thus obtained from the sample mean 
(or sample sum) by subtracting its mean and dividing by its standard deviation. This 
transformation is referred to as standardizing a random variable, so that it has mean 0 
and variance 1. Therefore, Zn is the standardized version of the sample mean (sample 
sum). 

Note that the distribution of Z, shares two characteristics with the N (0, 1) distrib- 
ution, namely, it has mean 0 and variance 1. The central limit theorem shows that there 
is an even stronger relationship. 


216 Section 4.4: Convergence in Distribution 


Theorem 4.4.3 (The central limit theorem) Let X1, X2, . . . be i.i.d. with finite mean 
u and finite variance o°. Let Z ~ N(0, 1). Then as n —> oo, the sequence {Z,} 


ett meee r D 
converges in distribution to Z, i.e., Zn > Z. 


PROOF | See Section 4.7 for the proof of this result. E 


The central limit theorem is so important that we shall restate its conclusions in 
several different ways. 


Corollary 4.4.2 For each fixed x € R E PZ EE x) = O(x), where ® is 
the cumulative distribution function for the standard normal distribution. 


We can write this as follows. 
Corollary 4.4.3 For each fixed x € R!, 


lim P(S, <nutx Jno) = (x) and lim P(M, < utxoa//n) = B(x). 
n—- Co n—- Co 


In particular, S, is approximately equal to nu, with deviations from this value of 
order „y/n, and Mn is approximately equal to u, with deviations from this value of 


order 1/,/n. 


We note that it is not essential in the central limit theorem to divide by ø, in which 
case the theorem asserts instead that (Sn — ny) /./n (or ./n (Mn — u)) converges in 
distribution to the N (0, o?) distribution. That is, the limiting distribution will still be 
normal but will have variance ø? instead of variance 1. 

Similarly, instead of dividing by exactly ø, it suffices to divide by any quantity on, 
provided opn Sol A simple modification of the proof of Theorem 4.4.2 leads to the 
following result. 


Corollary 4.4.4 If 


an ane Mae 
On/fn 


; as. D 
and limpo on > o, then Z7 > Z asn > ov. 


To illustrate the central limit theorem, we consider a simulation experiment. 


EXAMPLE 4.4.7 The Central Limit Theorem Illustrated in a Simulation 
Suppose we generate a sample X1, ..., Xn from the Uniform[0, 1] density. Note that 
the Uniform[0, 1] density is completely unlike a normal density. An easy calculation 
shows that when XY ~ Uniform[0, 1], then E (X) = 1/2 and Var(X) = 1/12. 

Now suppose we are interested in the distribution of the sample average M, = 
Sn/n = (Xi +---+ Xn) /n for various choices of n. The central limit theorem tells 


us that 4 2 y 1/2 
nn _ n— 
Z, = EE = va 1/12 ) 


Chapter 4: Sampling Distributions and Limits 217 


converges in distribution to an N(0, 1) distribution. But how large does n have to be 
for this approximation to be accurate? 

To assess this, we ran a Monte Carlo simulation experiment. In Figure 4.4.1, we 
have plotted a density histogram of N = 10° values from the N(0, 1) distribution based 
on 800 subintervals of (—4, 4) , each of length / = 0.01. Density histograms are more 
extensively discussed in Section 5.4.3, but for now we note that above each interval 
we have plotted the proportion of sampled values that fell in the interval, divided by 
the length of the interval. As we increase N and decrease /, these histograms will look 
more and more like the density of the distribution from which we are sampling. Indeed, 
Figure 4.4.1 looks very much like an N (0, 1) density, as it should. 

In Figure 4.4.2, we have plotted a density histogram (using the same values of N 
and /) of Z1. Note that Zı ~ Uniform[—/12/2, 12/2], and indeed the histogram 
does look like a uniform density. Figure 4.4.3 presents a density histogram of Z2, 
which still looks very nonnormal — but note that the histogram of Z3 in Figure 4.4.4 
is beginning to look more like a normal distribution. The histogram of Z19 in Fig- 
ure 4.4.5 looks very normal. In fact, the proportion of Z19 values in (—oo, 1.96], for 
this histogram, equals 0.9759, while the exact proportion for an N (0, 1) distribution is 
0.9750. 


Density 
o 
Ñ 
l 


Density 
° 
Ñ 
l 


Figure 4.4.2: Density histogram for 10° values of Z1 in Example 4.4.7. 


218 Section 4.4: Convergence in Distribution 


o4 — 


02 ma | 


Density 


0i- 


0.0 


Figure 4.4.3: Density histogram for 10° values of Z2 in Example 4.4.7. 


04 — 


03 — 


02 4 


Density 


01 4 


0.0 


Figure 4.4.4: Density histogram for 10° values of Z3 in Example 4.4.7. 


o4 — ra se 


03 — 


02 4 


Density 


01 


0.0 


A 
w 
© 
o 
N 
wo 
a 


Figure 4.4.5: Density histogram for 10° values of Z1ọ in Example 4.4.7. 


So in this example, the central limit theorem has taken effect very quickly, even 
though we are sampling from a very nonnormal distribution. As it turns out, it is 


Chapter 4: Sampling Distributions and Limits 219 


primarily the tails of a distribution that determine how large n has to be for the central 
limit theorem approximation to be accurate. When a distribution has tails no heavier 
than a normal distribution, we can expect the approximation to be quite accurate for 
relatively small sample sizes. E 


We consider some further applications of the central limit theorem. 


EXAMPLE 4.4.8 

For example, suppose X1, X2,... are i.i.d. random variables, each with the Poisson(5) 
distribution. Recall that this implies that u = E(X;) = 5 and o? =Var(X;) = 5. 
Hence, for each fixed x € R!, we have 


P Q < 5n +xv5n) > O(x) 


asn —> co. 


EXAMPLE 4.4.9 Normal Approximation to the Binomial Distribution 

Suppose X1, X2, ... are i.i.d. random variables, each with the Bernoulli (0) distribu- 
tion. Recall that this implies that E (X;) = 0 and v = Var(X; ) = 0 (1 — 0). Hence, for 
each fixed x € R!, we have 


P Q < n0 +x yno (l =0)) > (x), (4.4.1) 


asn > oo. 
But now note that we have previously shown that Y, = S, ~ Binomial (n, 0). So 
(4.4.1) implies that whenever we have a random variable Y, ~ Binomial(n, 0) , then 


Y, — n0 y—né )* ( y—né 


Pi <y)=P (GE < Tn =O) 


) (4.4.2) 
for large n. 

Note that we are approximating a discrete distribution by a continuous distribu- 
tion here. Reflecting this, a small improvement is often made to (4.4.2) when y is a 
nonnegative integer. Instead, we use 


0.5 — n0 
EDEL =): 


Jno (1 —0) 


Adding 0.5 to y is called the correction for continuity. In effect, this allocates all the 
relevant normal probability in the interval (y — 0.5, y + 0.5) to the nonnegative integer 
y. This has been shown to improve the approximation (4.4.2). E 


EXAMPLE 4.4.10 Approximating Probabilities Using the Central Limit Theorem 
While there are tables for the binomial distribution (Table D.6), we often have to com- 
pute binomial probabilities for situations the tables do not cover. We can always use 
statistical software for this, in fact, such software makes use of the normal approxima- 
tion we derived from the central limit theorem. 

For example, suppose that we have a biased coin, where the probability of getting 
a head on a single toss is 6 = 0.6. We will toss the coin n = 1000 times and then 


220 Section 4.4: Convergence in Distribution 


calculate the probability of getting at least 550 heads and no more than 625 heads. 
If Y denotes the number of heads obtained in the 1000 tosses, we have that Y ~ 
Binomial(1000, 0.6) , so 


E(Y) = 1000(0.6) = 600, 
Var(Y) 1000 (0.6) (0.4) = 240. 


Therefore, using the correction for continuity and Table D.2, 


P(550 < Y < 625) = P(550 — 0.5 < Y < 6254+0.5) 


y (#5 — 600 _ Y -600 _ 625.5 — so) 
E 240 7 s40 ~ s0 
Y = 
=P (-3.2598 < 209 < 1.046) 
/240 


= M(1.65) — ®(—3.26) = 0.9505 — 0.0006 = 0.9499. 


Note that it would be impossible to compute this probability using the formulas for the 
binomial distribution. E 


One of the most important uses of the central limit theorem is that it leads to a 
method for assessing the error in an average when this is estimating or approximating 
some quantity of interest. 


4.4.2 | The Central Limit Theorem and Assessing Error 


Suppose X1, X2,... is an 1.i.d. sequence of random variables, each with finite mean 
u and finite variance o7, and we are using the sample average M, to approximate the 
mean yw. This situation arises commonly in many computational (see Section 4.5) and 
statistical (see Chapter 6) problems. In such a context, we can generate the X;, but we 
do not know the value of u. 

If we approximate u by Mn, then a natural question to ask is: How much error is 
there in the approximation? The central limit theorem tells us that 


lim P (-3 Tae <3) 
noo a/J/n 
; o o 
dim P (m ma <u < Mn +32.) . 
Using Table D.2 (or statistical software), we have that ® (3) — ®(—3) = 0.9987 — (1 — 
0.9987) = 0.9974. So, for large n, we have that the interval 


(Mn —30/J/n, Mn + 30 /./n) 


contains the unknown value of u with virtual certainty (actually with probability about 
0.9974). Therefore, the half-length 30 /./n of this interval gives us an assessment of 
the error in the approximation M,,. Note that Var(M,) = o7/n, so the half-length of 
the interval equals 3 standard deviations of the estimate M,,. 


(3) — ©(-3) 


Chapter 4: Sampling Distributions and Limits 221 


Because we do not know y, it is extremely unlikely that we will know ø (as its 
definition uses u). But if we can find a consistent estimate o,, of ø, then we can use 
Corollary 4.4.4 instead to construct such an interval. 

As it turns out, the correct choice of øo , depends on what we know about the distri- 
bution we are sampling from (see Chapter 6 for more discussion of this). For example, 
if X; ~ Bernoulli(@) , then u = 0 and o? = Var(X)) = 0 (1 — 0). By the strong law 
of large numbers (Theorem 4.3.2), M, > u = 0 and thus 


on = JM, (1 —M,) > J/O(1—-86) =o. 


Then, using the same argument as above, we have that, for large n, the interval 


(m, — 3./My, (1 — Mn) /n, Mn +34/Mn (1 — My) n) (4.4.3) 


contains the true value of 0 with virtual certainty (again, with probability about 0.9974). 
The half-length of (4.4.3) is a measure of the accuracy of the estimate M, — no- 
tice that this can be computed from the values X1,..., Xn. We refer to the quantity 
(Mn (1 — Mn) /n)'/? as the standard error of the estimate My. 

For a general random variable X1, let 


1 n 1 n n 
2 — > i- M,’ = (ee —2Mn > Xi nite) 
i=l i=l i=l 
n ba n 1# 
-NY X 2M +M |= yL- mMN\. 
n—I (G È n TMp n-1\n 2 : ý 
By the strong law of large numbers, we have that M, pl u and 


E 
— XPS E (xt) = 07 + 2. 
nN * 

i=l 


a 
3 
| 


Because n/(n — 1) — 1 and MŽ s u? as well, we conclude that o? “+ ø?. This 


implies that o , 1Y ø hence ø, is consistent for ø . It is common to call o2 the sample 
variance of the sample X1, ..., Xn. When the sample size n is fixed, we will often 
denote this estimate of the variance by S*. 

Again, using the above argument, we have that, for large n, the interval 


(Mn —30n/V 1, Mn + 30n//n) = (My —3S/ 4/7, Ma +3S/V/n) (4.4.4) 


contains the true value of u with virtual certainty (also with probability about 0.9974). 
Therefore, the half-length is a measure of the accuracy of the estimate M, — notice 
that this can be computed from the values X1, .. . , Xn. The quantity S/,/n is referred 
to as the standard error of the estimate M,,. 

We will make use of these estimates of the error in approximations in the following 
section. 


222 Section 4.4: Convergence in Distribution 


Summary of Section 4.4 


e A sequence {X,,} of random variables converges in distribution to Y if, for all y € 
R! with P(Y = y) = 0, we have lim, Fy, (v) = Fy (y), i.e., limy—soo P (Xn 
<y)= PË <y). 

e If {X,,} converges to Y in probability (or with probability 1), then {X,,} converges 
to Y in distribution. 

e The very important central limit theorem says that if {X,} are i.i.d. with finite 
mean u and variance g?, then the random variables Z, = (Sn — np)/./no 
converge in distribution to a standard normal distribution. 


The central limit theorem allows us to approximate various distributions by nor- 
mal distributions, which is helpful in simulation experiments and in many other 
contexts. Table D.2 (or any statistical software package) provides values for the 
cumulative distribution function of a standard normal. 


EXERCISES 


4.4.1 Suppose P(X, = i) = (n + i)/Gn + 6) for i = 1,2,3. Suppose also that 
P(X =i) = 1/3 fori = 1, 2,3. Prove that {X,,} converges in distribution to X. 
4.4.2 Suppose P(Y, = k) = (1 —27"7!)7!/2"+! fork = 0,1,...,n. Let Y ~ 
Geometric(1/2). Prove that {Y,,} converges in distribution to Y. 
4.4.3 Let Zn have density (n + 1)x” forO < x < 1, and 0 otherwise. Let Z = 1. 
Prove that {Z,,} converges in distribution to Z. 
4.4.4 Let Wn have density 

l +x/n 

1 +1/2n 


for0 < x < 1, and 0 otherwise. Let W ~ Uniform[0, 1]. Prove that { W, } converges 
in distribution to W. 

4.4.5 Let Y1, Y2,... be iid. with distribution Exponential(3). Use the central limit 
theorem and Table D.2 (or software) to estimate the probability Poe Y; < 540). 
4.4.6 Let Zi, Z2,... be i.i.d. with distribution Uniform[—20, 10]. Use the central 
limit theorem and Table D.2 (or software) to estimate the probability POE Zi > 
—4470). 

4.4.7 Let X1, X2,... bei.id. with distribution Geometric(1 /4). Use the central limit 
theorem and Table D.2 (or software) to estimate the probability PES Xi > 2450). 
4.4.8 Suppose Xn ~ N(0,1/n), i.e., Xn has a normal distribution with mean 0 and 
variance 1/n. Does the sequence {X} converge in distribution to some random vari- 
able? If yes, what is the distribution of the random variable? 

4.4.9 Suppose P(Xn =i/n) = 2i/n(n +1) fori =1,2,3,...,n. Let Z have density 
function given by f(z) = 2z for 0 < z < 1, otherwise f(z) = 0. 

(a) Compute P(Z < y) for0 <y <1. 

(b) Compute P(X, < m/n) for some integer 1 < m < n. (Hint: Remember that 
dL i = m(m + 1)/2.) 


Chapter 4: Sampling Distributions and Limits 223 


(c) Compute P(X, < vy) for0 <y <1. 

(d) Prove that X, 5z. 

4.4.10 Suppose P(Y, < y) = 1—e72”/@+ for all y > 0. Prove that Y, Z Y where 
Y ~ Exponential(4) for some 4 > 0 and compute 4. 

4.4.11 Suppose P(Z, < z)=1-— (l — 32 yn for all0 < z < n/3. Prove that Z, 27 
where Z ~ Exponential (4) for some 4 > 0 and compute 4. (Hint: Recall from calculus 
that lim,—>oo(1 + $)” = e° for any real number c.) 

4.4.12 Suppose the service time, in minutes, at a bank has the Exponential distribution 


with 1 = 1/2. Use the central limit theorem to estimate the probability that the average 
service time of the first n customers is less than 2.5 minutes, when: 


(a)n = 16. 
(b)n = 36. 
(c)n = 100. 


4.4.13 Suppose the number of kilograms of a metal alloy produced by a factory each 
week is uniformly distributed between 20 and 30. Use the central limit theorem to esti- 
mate the probability that next year’s output will be less than 1280 kilograms. (Assume 
that a year contains precisely 52 weeks.) 

4.4.14 Suppose the time, in days, until a component fails has the Gamma distribution 
with a = 5 and 4 = 1/10. When a component fails, it is immediately replaced by 
a new component. Use the central limit theorem to estimate the probability that 40 
components will together be sufficient to last at least 6 years. (Assume that a year 
contains precisely 365.25 days.) 


COMPUTER EXERCISES 


4.4.15 Generate N samples X1, X2,..., X20 ~ Exponential(3) for N large (N = 104, 
if possible). Use these samples to estimate the probability P (1/6 < M29 < 1/2). How 
does your answer compare to what the central limit theorem gives as an approximation? 
4.4.16 Generate N samples X1, X2,..., X30 ~ Uniform[—20, 10] for N large (N = 
104, if possible). Use these samples to estimate the probability P (M3ọ < —5). How 
does your answer compare to what the central limit theorem gives as an approximation? 
4.4.17 Generate N samples X1, X2,..., X20 ~ Geometric(1/4) for N large (N = 
104, if possible). Use these samples to estimate the probability P (2.5 < M20 < 3.3). 
How does your answer compare to what the central limit theorem gives as an approxi- 
mation? 

4.4.18 Generate N samples X1, X2, . . . , X20 from the distribution of log Z where Z ~ 
Gamma(4, 1) for N large (N = 104, if possible). Use these samples to construct a 
density histogram of the values of M29. Comment on the shape of this graph. 

4.4.19 Generate N samples X1, X2,..., X20 from the Binomial (10, 0.01) distribution 
for N large (N = 10+, if possible). Use these samples to construct a density histogram 
of the values of M20. Comment on the shape of this graph. 


224 Section 4.5: Monte Carlo Approximations 


PROBLEMS 


4.4.20 Let a1, a2,... be any sequence of nonnegative real numbers with X; a; = 1. 
Suppose P(X = i) = a; for every positive integer i. Construct a sequence {Xn} of 
absolutely continuous random variables, such that X, — X in distribution. 


4.4.21 Let f : [0, 1] > (0, co) be a continuous positive function such that i f(x) dx 
= 1. Consider random variables X and {X,} such that P(a < X < b) = J, f(x) dx 


fora < band 

P(%= 2) = say 

n > j=1 fU/n) 

fori = 1,2,3,...,n. Prove that X, —> X in distribution. 
4.4.22 Suppose that Y; = X and that X1, ..., X, is a sample from an N (u, o°) 
distribution. Indicate how you would approximate the probability P (M, < m) where 
Mn = (Yi +- + Yn) /n. 
4.4.23 Suppose Y; = cos (2m U;) and U}, ..., Un is a sample from the Uniform[0, 1] 
distribution. Indicate how you would approximate the probability P (Mn 
< m), where Mn = (Y1 +---+ Yn) /n. 


COMPUTER PROBLEMS 


4.4.24 Suppose that Y = X? and X ~ N(0, 1). By generating a large sample (n = 
104, if possible) from the distribution of Y, approximate the probability P(Y < 1) and 
assess the error in your approximation. Compute this probability exactly and compare 
it with your approximation. 

4.4.25 Suppose that Y = X? and X ~ N(0, 1). By generating a large sample (n = 
104, if possible) from the distribution of Y, approximate the expectation E (cos(X¥°)), 
and assess the error in your approximation. 


CHALLENGES 


4.4.26 Suppose X,, > C in distribution, where C is a constant. Prove that X, > C 
in probability. (This proves that if X is constant, then the converse to Theorem 4.4.1 
does hold, even though it does not hold for general X.) 


4.5 | Monte Carlo Approximations 


The laws of large numbers say that if X1, X2,... is an i.i.d. sequence of random vari- 


ables with mean u, and 
Xi +e +X, 
M, = si ES Bie 
n 

then for large n we will have Mn © u. 

Suppose now that u is unknown. Then, as discussed in Section 4.4.2, it is possible 
to change perspective and use M, (for large n) as an estimator or approximation of u. 
Any time we approximate or estimate a quantity, we must also say something about 


Chapter 4: Sampling Distributions and Limits 225 


how much error is in the estimate. Of course, we cannot say what this error is exactly, 
as that would require knowing the exact value of u. In Section 4.4.2, however, we 
showed how the central limit theorem leads to a very natural approach to assessing this 
error, using three times the standard error of the estimate. We consider some examples. 


EXAMPLE 4.5.1 

Consider flipping a sequence of identical coins, each of which has probability 0 of 
coming up heads, but where 0 is unknown. Let M, again be the fraction of the first n 
coins that are heads. Then we know that for large n, it is very likely that M, is very 
close to 0. Hence, we can use M, to estimate 0. Furthermore, the discussion in Section 
4.4.2 indicates that (4.4.3) is the relevant interval to quote when assessing the accuracy 
of the estimate M,,. E 


EXAMPLE 4.5.2 
Suppose we believe a certain medicine lowers blood pressure, but we do not know by 
how much. We would like to know the mean amount u, by which this medicine lowers 
blood pressure. 

Suppose we observe n patients (chosen at random so they are i.i.d.), where patient 
i has blood pressure B; before taking the medicine and blood pressure A; afterwards. 
Let X; = B; = Ai: Then 


1 n 
Mn = — > (Bi - Ai) 
i=l 


is the average amount of blood pressure decrease. (Note that B; — A; may be negative 
for some patients, and it is important to also include those negative terms in the sum.) 
Then for large n, the value of M, is a good estimate of E(X;) = u. Furthermore, the 
discussion in Section 4.4.2 indicates that (4.4.4) is the relevant interval to quote when 
assessing the accuracy of the estimate M„. E 


Such estimators can also be used to estimate purely mathematical quantities that do 
not involve any experimental data (such as coins or medical patients) but that are too 
difficult to compute directly. In this case, such estimators are called Monte Carlo ap- 
proximations (named after the gambling casino in the principality of Monaco because 
they introduce randomness to solve nonrandom problems). 


EXAMPLE 4.5.3 
Suppose we wish to evaluate 


1 
i= f cos(x?) sin(x*) dx. 
0 


This integral cannot easily be solved exactly. But it can be approximately computed 
using a Monte Carlo approximation, as follows. We note that 


I = E(cos(U7) sin(U*)), 


where U ~ Uniform[0, 1]. Hence, for large n, the integral J is approximately equal 
to Mn = (Ti +--+ + T,)/n, where T; = cos(U?) sin(U;), and where U1, U2, ... are 
i.i.d. Uniform[0, 1]. 


226 Section 4.5: Monte Carlo Approximations 


Putting this all together, we obtain an algorithm for approximating the integral /, 
as follows. 


l. Select a large positive integer n. 
2. Obtain U; ~ Uniform[0,1], independently for i=1,2,...,n. 


3. Set T; =cos(U?) sin(U), for i=1,2,...,n. 


4. Estimate I by M,=(%+---+7,)/n. 


For large enough n, this algorithm will provide a good estimate of the integral Z. 
For example, the following table records the estimates M,, and the intervals (4.4.4) 
based on samples of Uniform[0,1] variables for various choices of n. 


M, My —3S/Jn Mn +3S/Jn 


0.145294 0.130071 0.160518 


0.138850 0.134105 0.143595 
0.139484 0.137974 0.140993 


From this we can see that the value of J is approximately 0.139484, and the true value 
is almost certainly in the interval (0.137974, 0.140993). Notice how the lengths of 
the intervals decrease as we increase n. In fact, it can be shown that the exact value is 
I = 0.139567, so our approximation is excellent. E 


EXAMPLE 4.5.4 
Suppose we want to evaluate the integral 


OO 
D) 25x? cos(x*)e~2* dx. 
0 


This integral cannot easily be solved exactly, but it can also be approximately computed 
using a Monte Carlo approximation, as follows. 

We note first that Z = E(X? cos(Xĉ?)), where X ~ Exponential(25). Hence, for 
large n, the integral J is approximately equal to M, = (71 +--- + Ta)/n, where 
T; = X? cos(X?), with X1, X2, ... iid. Exponential(25). 

Now, we know from Section 2.10 that we can simulate X ~ Exponential(25) by 
setting X = —In(U)/25 where U ~ Uniform[0, 1]. Hence, putting this all together, 
we obtain an algorithm for approximating the integral Z, as follows. 


l. Select a large positive integer n. 
2. Obtain U; ~ Uniform[0,1], independently for i=1,2,...,n. 
3. Set X; =—In(U;)/25, for i=1,2,...,n. 


4. Set T; = X? cos(X?), for i=1,2,...,n. 


5. Estimate J by Mn, = (Ti +--+ Ta)/n. 


Chapter 4: Sampling Distributions and Limits 227 


For large enough n, this algorithm will provide a good estimate of the integral /. 
For example, the following table records the estimates M, and the intervals (4.4.4) 
based on samples of Exponential(25) variables for various choices of n. 


Mn —3S/Jn Mn +3S/Jn 
3.33846 x 103 2.63370 x 10> 4.04321 x 107 


3.29933 x 1073 3.06646 x 107? 3.53220 x 1073 
3.20629 x 107? 3.13759 x 1073 3.27499 x 1073 


From this we can see that the value of J is approximately 3.20629 x 107° and that the 
true value is almost certainly in the interval (3.13759 x 1073, 3.27499 x 107°). m 


EXAMPLE 4.5.5 
Suppose we want to evaluate the sum 


O° rs 
S= (J? +3) 15. 


j=0 


Though this is very difficult to compute directly, it can be approximately computed 
using a Monte Carlo approximation. 
Let us rewrite the sum as 


OSE OY) 


We then see that S = (5/4) E((X* + 3)~’), where X ~ Geometric(4/5). 

Now, we know from Section 2.10 that we can simulate X ~ Geometric(4/5) by 
setting X = [Indi — U)/ln(1 — 4/5)] or, equivalently, X = |In(U)/ln(1 — 4/5)], 
where U ~ Uniform[0, 1] and where |-| means to round down to the next integer 
value. Hence, we obtain an algorithm for approximating the sum S, as follows. 


l. Select a large positive integer n. 
2. Obtain U; ~ Uniforn[0,1], independently for i=1,2,...,n. 
3. Set X; = Un(U;)/Ind —4/5)|, for i=1,2,...,n. 


4. Set T; = (X? +3), for i=1,2,...,7- 


5. Estimate S by Mn, = (5/4)(1 +---+7))/n. 


For large enough n, this algorithm will provide a good estimate of the sum S. 
For example, the following table records the estimates M,, and the intervals (4.4.4) 
based on samples of Geometric(4/5) variables for various choices of n. 


Mn — 3S//n M, +3S/./n 
4.47078 x 107° 4.86468 x 107 


4.73538 x 1074 4.67490 x 1074 4.79586 x 1074 
4.69377 x 1074 4.67436 x 1074 4.71318 x 1074 


228 Section 4.5: Monte Carlo Approximations 


From this we can see that the value of S is approximately 4.69377 x 1074 and that the 
true value is almost certainly in the interval (4.67436 x 1074, 4.71318 x 1074). I 


Note that when using a Monte Carlo approximation, it is not necessary that the 
range of an integral or sum be the entire range of the corresponding random variable, 
as follows. 


EXAMPLE 4.5.6 
Suppose we want to evaluate the integral 


P9 2 
J =f sin(x)e~* /? dx. 
0 


Again, this is extremely difficult to evaluate exactly. 
Here 


J = V2n E(sin(X)I,x50)), 


where X ~ N(O,1) and /,y50} is the indicator function of the event {X > 0}. We 
know from Section 2.10 that we can simulate X ~ N(0, 1) by setting 


X = 4⁄2 log(1/U) cos(Q2zV), 


where U and V are i.i.d. Uniform[0, 1]. Hence, we obtain the following algorithm for 
approximating the integral J. 


l. Select a large positive integer n. 


2. Obtain U;,V; ~ Uniform[0, 1], independently for i=1,2,..., 
n. 


3. Set X; = ./2 log(1/U;) cosQaV;), for i=1,2,...,n. 


4. Set T; = sin(X;)ix;>0}; for i=1,2,...,n. (That is, set T; = 
sin(X;) if X; >0, otherwise set 7; =0.) 


5. Estimate J by M,=WV2a((]+---+T7,)/n. 


For large enough n, this algorithm will again provide a good estimate of the integral 7. 
For example, the following table records the estimates M,, and the intervals (4.4.4) 
based on samples of N (0, 1) variables for various choices of n. 


Mn, —3S//n M,+3S/J/n 
0.744037 0.657294 0.830779 


0.733945 0.706658 0.761233 
0.722753 0.714108 0.731398 


From this we can see that the value of J is approximately 0.722753 and that the true 
value is almost certainly in the interval (0.714108, 0.731398). E 


Now we consider an important problem for statistical applications of probability 
theory. 


Chapter 4: Sampling Distributions and Limits 229 


EXAMPLE 4.5.7 Approximating Sampling Distributions Using Monte Carlo 
Suppose X1, X2,..., Xn is an i.i.d. sequence from the probability measure P. We want 
to find the distribution of a new random variable Y = A(X1, X2,..., Xn) for some 
function h. Provided we can generate from P, then Monte Carlo methods give us a 
way to approximate this distribution. 

Denoting the cumulative distribution function of Y by Fy, we have 


Fy) = P((-00, y]) = Ery (l o,y] YD) = EU 00,9) AXi, X2, <- Xn))). 
So Fy (y) can be expressed as the expectation of the random variable 
I(-c0,y] AXi, X2,-.-, Xn)) 


based on sampling from P. 
To estimate this, we generate NV samples of size n 


(Xi, Xi2, EEES Xin), 


fori = 1,..., N from P (note N is the Monte Carlo sample size and can be varied, 
whereas the sample size n is fixed here) and then calculate the proportion of values 
h(Xi1, Xi2,..., Xin) € (—00, y] . The estimate My is then given by 


7 Tua 
Fy) = W Desi (A(Xi1, Xi2,..-, Xin)). 
i=l 


By the laws of large numbers, this converges to Fy (y) as N — oo. To evaluate the 
error in this approximation, we use (4.4.3), which now takes the form 


(40 -3&0 (1 - Fr) /n, Fr) +3,/ Fr) (1 - Fro) m) 


We presented an application of this in Example 4.4.7. Note that if the base of a rec- 
tangle in the histogram of Figure 4.4.2 is given by (a, b] , then the height of this rectan- 
gle equals the proportion of values that fell in (a, b] times 1/ (b — a) . This can be ex- 
pressed as (Fy (b) — Fy (a))/ (b — a) , which converges to (Fy (b) — Fy(a)) / (b — a) 
as N — oo. This proves that the areas of the rectangles in the histogram converge to 
Fy(b) — Fy (a) as N > œœ. 

More generally, we can approximate an expectation E (g (Y)) using the average 


1 N 
Wy Dg (Xi, AiK 
i=l 


By the laws of large numbers, this average converges to E (g (Y)) as N > o0.8 


Typically, there is more than one possible Monte Carlo algorithm for estimating 
a quantity of interest. For example, suppose we want to approximate the integral 
J; 4 g(x)dx, where we assume this integral is finite. Let f be a density on the interval 


230 Section 4.5: Monte Carlo Approximations 


(a, b), such that f(x) > 0 for every x € (a,b), and suppose we have a convenient 
algorithm for generating X1, X2, ...1.1.d. with distribution given by f. We have that 


: b g(x) (22) 
ey eed dx = E (eee 
[ soa: a O Nag 


when X is distributed with density f. So we can estimate f g(x)dx by 


1Y Xi 1 
meye aS, 
rr SX) nE 


where T; = g(X;)/f (Xi). In effect, this is what we did in Example 4.5.3 (f is the 
Uniform[0, 1] density), in Example 4.5.4 (f is the Exponential(25) density), and in 
Example 4.5.6 (f is the N (0, 1) density). But note that there are many other possible 
choices. In Example 4.5.3, we could have taken f to be any beta density. In Example 
4.5.4, we could have taken f to be any gamma density, and similarly in Example 
4.5.6. Most statistical computer packages have commands for generating from these 
distributions. In a given problem, what is the best one to use? 

In such a case, we would naturally use the algorithm that was most efficient. For 
the algorithms we have been discussing here, this means that if, based on a sample 
of n, algorithm 1 leads to an estimate with standard error o;/,/n, and algorithm 2 
leads to an estimate with standard error o2/,/n, then algorithm 1 is more efficient than 
algorithm 2 whenever a; < o2. Naturally, we would prefer algorithm 1 because the 
intervals (4.4.3) or (4.4.4) will tend to be shorter for algorithm 1 for the same sample 
size. Actually, a more refined comparison of efficiency would also take into account the 
total amount of computer time used by each algorithm, but we will ignore this aspect of 
the problem here. See Problem 4.5.21 for more discussion of efficiency and the choice 
of algorithm in the context of the integration problem. 


Summary of Section 4.5 


e An unknown quantity can be approximately computed using a Monte Carlo ap- 
proximation, whereby independent replications of a random experiment (usually 
on a computer) are averaged to estimate the quantity. 

e Monte Carlo approximations can be used to approximate complicated sums, in- 
tegrals, and sampling distributions, all by choosing the random experiment ap- 
propriately. 


EXERCISES 


4.5.1 Describe a Monte Carlo approximation of SA cos? (x Je 2 dx. 


4.5.2 Describe a Monte Carlo approximation of 2o jo (7)2 3", (Hint: Remember 
the Binomial(m, 2/3) distribution.) 


4.5.3 Describe a Monte Carlo approximation of Pa e75x-l4? dy, (Hint: Remember 
the Exponential(5) distribution.) 


Chapter 4: Sampling Distributions and Limits 231 


4.5.4 Suppose X1, X2,... are i.i.d. with distribution Poisson(/), where 4 is unknown. 
Consider M, = (X1 + X2 +---+ Xn)/n as an estimate of 4. Suppose we know that 
A < 10. How large must n be to guarantee that M, will be within 0.1 of the true value 
of 2 with virtual certainty, i.e., when is 3 standard deviations smaller than 0.1? 

4.5.5 Describe a Monte Carlo approximation of Eo sin(j?®)5} / j!. Assume you 
have available an algorithm for generating from the Poisson(5) distribution. 


4.5.6 Describe a Monte Carlo approximation of Py e dx. (Hint: Remember the 
Uniform[0, 10] distribution.) 

4.5.7 Suppose we repeat a certain experiment 2000 times and obtain a sample average 
of —5 and a standard error of 17. In terms of this, specify an interval that is virtually 
certain to contain the experiment’s (unknown) true mean u. 

4.5.8 Suppose we repeat a certain experiment 400 times and get i.i.d. response values 
X1, X2, ..., X400. Suppose we compute that the sample average is M4o9 = 6 and 
furthermore that ye (X;)* =15,400. In terms of this: 

(a) Compute the standard error opn. 

(b) Specify an interval that is virtually certain to contain the (unknown) true mean u of 
the X;. 

4.5.9 Suppose a certain experiment has probability 0 of success, where 0 < 0 < 1 
but @ is unknown. Suppose we repeat the experiment 1000 times, of which 400 are 
successes and 600 are failures. Compute an interval of values that are virtually certain 
to contain 0. 

4.5.10 Suppose a certain experiment has probability 0 of success, where 0 < 0 < 1 
but 0 is unknown. Suppose we repeat the experiment n times, and let Y be the fraction 
of successes. 

(a) In terms of 0, what is Var(Y)? 

(b) For what value of @ is Var(Y) the largest? 

(c) What is this largest possible value of Var(Y)? 

(d) Compute the smallest integer n such that we can be sure that Var(Y) < 0.01, 
regardless of the value of 8. 

4.5.11 Suppose X and Y are random variables with joint density given by fy y(x, vy) = 
C g(x,y) for0 < x,y < 1 (with fy y(x, y) = 0 for other x, y), for appropriate con- 
stant C, where 


g(x,y) = x? y? siny) cos( JF) exp(x? +y). 


(a) Explain why 


fo fo x ge, y)dx dy 


1 1 
E => > d d = . 
*) [ I PET TT i ay 


(b) Describe a Monte Carlo algorithm to approximately compute E (X). 

4.5.12 Let g(x, y) = cos(,/xy ), and consider the integral J = h IN g(x, y)dy dx. 
(a) Prove that J = 20 E[g(X, Y)] where X ~ Uniform[0, 5] and Y ~ Uniform[0, 4]. 
(b) Use part (a) to describe a Monte Carlo algorithm to approximately compute 7. 


232 Section 4.5: Monte Carlo Approximations 
4.5.13 Consider the integral J = fe Jo. hŒ, y)dy dx, where 


h(æ, y) =e” cos( xy). 


(a) Prove that J = E[e’ h(X, Y)], where X ~ Uniform[0, 1] and Y ~ Exponential(1). 
(b) Use part (a) to describe a Monte Carlo algorithm to approximately compute J. 
(c) If X ~ Uniform[0, 1] and Y ~ Exponential(5), then prove that 


J = (1/5) Efe” h(X, Y)]. 


(d) Use part (c) to describe a Monte Carlo algorithm to approximately compute J. 
(e) Explain how you might use a computer to determine which is better, the algorithm 
in part (b) or the algorithm in part (d). 


COMPUTER EXERCISES 


4.5.14 Use a Monte Carlo algorithm to approximate fo cos(x?) sin(x*) dx based on a 
large sample (take n = 10°, if possible). Assess the error in the approximation. 

4.5.15 Use a Monte Carlo algorithm to approximate io 25 cos(x*)e~*>* dx based on 
a large sample (take n = 10°, if possible). Assess the error in the approximation. 
4.5.16 Use a Monte Carlo algorithm to approximate YoU 2 4 3)7557 based on a 
large sample (take n = 105, if possible). Assess the error in the approximation. 

4.5.17 Suppose X ~ N(0, 1). Use a Monte Carlo algorithm to approximate P (X? — 
3X + 2 > 0) based on a large sample (take n = 10°, if possible). Assess the error in 
the approximation. 


PROBLEMS 


4.5.18 Suppose that X1, X2, ... are i.i.d. Bernoulli (0) where 0 is unknown. Determine 
a lower bound on n so that the probability that the estimate M, will be within 6 of 
the unknown value of 0 is about 0.9974. This allows us to run simulations with high 
confidence that the error in the approximation quoted is less than some prescribed value 
ô. (Hint: Use the fact that x (1 — x) < 1/4 forall x € [0, 1].) 

4.5.19 Suppose that X1, X2,... arei.i.d. with unknown mean u and unknown variance 
a2. Suppose we know, however, that o? < o, where o? is a known value. Determine 
a lower bound on n so that the probability that the estimate M„ will be within 6 of 
the unknown value of is about 0.9974. This allows us to run simulations with high 
confidence that the error in the approximation quoted is less than some prescribed value 
ô. 

4.5.20 Suppose X1, X2, ... are i.i.d. with distribution Uniform[0, 0], where @ is un- 
known, and consider Z, = n7! (n + 1) Xan) as an estimate of 6 (see Section 2.8.4 on 
order statistics). 

(a) Prove that E (Z,,) = 0 and compute Var(Z,,) . 


Chapter 4: Sampling Distributions and Limits 233 


(b) Use Chebyshev’s inequality to show that Z, converges in probability to 8. 
(c) Show that E (2M,,) = 0 and compare M,, and Z, with respect to their efficiencies 
as estimators of 0. Which would you use to estimate 0 and why? 
4.5.21 (importance sampling) Suppose we want to approximate the integral fe g(x) dx, 
where we assume this integral is finite. Let f be a density on the interval (a, b) such 
that f(x) > 0 for every x € (a, b) and is such that we have a convenient algorithm for 
generating X1, X2,...ii.d. with distribution given by f. 
(a) Prove that 
1 gX) as. f’ 
Mn(f) = - 3 f g(x) dx. 
í n >, f X: ) a 
(We refer to f as an importance sampler and note this shows that every f satisfying 
the above conditions, provides a consistent estimator M, (f) of f g(x) dx.) 
(b) Prove that 


1 b „2 b 2 
vuan = 5 | ceas-( els) dx) | 


(c) Suppose that g(x) = h(x) f(x), where f is as described above. Show that impor- 
tance sampling with respect to f leads to the estimator 


1 n 
Mf) = — F AXD. 
i=l 


(d) Show that if there exists c such that |g(x)| < cf() for all x e (a,b), then 
Var(Mn (f)) < oo. 

(e) Determine the standard error of M,„(f) and indicate how you would use this to 
assess the error in the approximation M, (f) when Var(M,,(/)) < oo. 


COMPUTER PROBLEMS 


4.5.22 Use a Monte Carlo algorithm to approximate P(X? + Y? < 3), where X ~ 
N(1, 2) independently of Y ~ Gamma(1, 1) based on a large sample (take n = 10°, 
if possible). Assess the error in the approximation. How large does n have to be to 
guarantee the estimate is within 0.01 of the true value with virtual certainty? (Hint: 
Problem 4.5.18.) 

4.5.23 Use a Monte Carlo algorithm to approximate E (X? + Y°), where X ~ N(1, 2) 
independently of Y ~ Gamma(1, 1) based on a large sample (take n = 10°, if possi- 
ble). Assess the error in the approximation. 

4.5.24 For the integral of Exercise 4.5.3, compare the efficiencies of the algorithm 
based on generating from an Exponential (5) distribution with that based on generating 
from an N (0, 1/7) distribution. 


234 Section 4.6: Normal Distribution Theory 


CHALLENGES 


4.5.25 (Buffon’s needle) Suppose you drop a needle at random onto a large sheet of 
lined paper. Assume the distance between the lines is exactly equal to the length of the 
needle. 

(a) Prove that the probability that the needle lands touching a line is equal to 2/z. 
(Hint: Let D be the distance from the higher end of the needle to the line just below it, 
and let A be the angle the needle makes with that line. Then what are the distributions 
of D and A? Under what conditions on D and A will the needle be touching a line?) 
(b) Explain how this experiment could be used to obtain a Monte Carlo approximation 
for the value of z. 

4.5.26 (Optimal importance sampling) Consider importance sampling as described in 
Problem 4.5.21. 


(a) Prove that Var(M,,(/)) is minimized by taking 


b 
fe) =l J lg@x)| dx. 


Calculate the minimum variance and show that the minimum variance is 0 when g (x) > 
0 forall x € (a, b). 

(b) Why is this optimal importance sampler typically not feasible? (The optimal im- 
portance sampler does indicate, however, that in our search for an efficient importance 
sampler, we look for an f that is large when |g] is large and small when |g| is small.) 


DISCUSSION TOPICS 


4.5.27 An integral like Ai x? cos(x?)e™* dx can be approximately computed using a 
numerical integration computer package (e.g., using Simpson’s rule). What are some 
advantages and disadvantages of using a Monte Carlo approximation instead of a nu- 
merical integration package? 


4.5.28 Carry out the Buffon’s needle Monte Carlo experiment, described in Challenge 
4.5.25, by repeating the experiment at least 20 times. Present the estimate of z so 
obtained. How close is it to the true value of z? What could be done to make the 
estimate more accurate? 


4.6 | Normal Distribution Theory 


Because of the central limit theorem (Theorem 4.4.3), the normal distribution plays 
an extremely important role in statistical theory. For this reason, we shall consider 
a number of important properties and distributions related to the normal distribution. 
These properties and distributions will be very important for the statistical theory in 
later chapters of this book. 

We already know that if X; ~ N(x), a) independent of X2 ~ N(uo, a5): then 
cX, +d ~ N(cuy +d, co”) (see Exercise 2.6.3) and X1 + X2 ~ N (u1 + u2, o? + 


Chapter 4: Sampling Distributions and Limits 235 


o?) (see Problem 2.9.14). Combining these facts and using induction, we have the 
following result. 


Theorem 4.6.1 Suppose X; ~ N(ui, o?) fori = 1,2,...,n and that they are 
independent random variables. Let Y = (>, a; X;) + b for some constants {a;} and 
b. Then 


Y~ n(Z an +o Deed), 


This immediately implies the following. 


Corollary 4.6.1 Suppose X; ~ N(u, o) fori = 1,2,..., n and that they are 
independent random variables. If ¥ = (X1 +---+X,)/n, then X ~ N(w,o7/n). 


A more subtle property of normal distributions is the following. 


Theorem 4.6.2 Suppose X; ~ N (ui, o?) fori = 1,2, ...,n and also that the {X;} 
are independent. Let U = X ¢_; a;X; and V = $; biX; for some constants {a;} 


and {b;}. Then Cov(U, V) = >; aibio?. Furthermore, Cov (U, V) = 0 if and only 
if U and V are independent. 


PROOF | The formula for Cov(U, V) follows immediately from the linearity of co- 
variance (Theorem 3.3.2) because we have 


Cov(U, V) 


com (Zax, Zew) = Sab; Cov(Xj, Xj) 
i=l j=l 


i=l j=l 


n n n 
Saibi Cov(X;, Xi) = S aibi Var(X;) = S aibio? 
i=l i=l i=l 


(note that Cov(X;, X;) = 0 fori Æ j, by independence). Also, if U and V are 
independent, then we must have Cov(U, V) = 0 by Corollary 3.3.2. 

It remains to prove that, if Cov(U, V) = 0, then U and V are independent. This 
involves a two-dimensional change of variable, as discussed in the advanced Section 
2.9.2, so we refer the reader to Section 4.7 for this part of the proof. E 


Theorem 4.6.2 says that, for the special case of linear combinations of independent 
normal distributions, if Cov(U, V) = 0, then U and V are independent. However, it is 
important to remember that this property is not true in general, and there are random 
variables X and Y such that Cov(X, Y) = 0 even though X and Y are not independent 
(see Example 3.3.10). Furthermore, this property is not even true of normal distribu- 
tions in general (see Problem 4.6.13). 


236 Section 4.6: Normal Distribution Theory 


Note that using linear algebra, we can write the equations U = >“"_, a; X; and 
V= ye) b, X; of Theorem 4.6.2 in matrix form as 


Xı 
U X2 
(7 )=4 EE (4.6.1) 
Xn 
where 
= a, a2... A 
asa Ba a) 


Furthermore, the rows of A are orthogonal if and only if X; aib; = 0. Now, in the case 
gi = | for alli, we have that Cov(U, V) = >, aibi. Hence, if o; = 1 for all i, then 
Theorem 4.6.2 can be interpreted as saying that if U and V are given by (4.6.1), then 
U and V are independent if and only if the rows of A are orthogonal. Linear algebra is 
used extensively in more advanced treatments of these ideas. 


4.6.1 | The Chi-Squared Distribution 


We now introduce another distribution, related to the normal distribution. 


Definition 4.6.1 The chi-squared distribution with n degrees of freedom (or chi- 
squared(n) , or 77(n)) is the distribution of the random variable 


SK RS ate 


where X1, ..., Xn are i.i.d., each with the standard normal distribution N (0, 1). 


Most statistical packages have built-in routines for the evaluation of chi-squared prob- 
abilities (also see Table D.3 in Appendix D). 
One property of the chi-squared distribution is easy. 


Theorem 4.6.3 If Z ~ 77(n), then E(Z) =n. 


PROOF | Write Z = X? + X2 +--+ X2, where {Xj} are iid. ~ N(0, 1). Then 
E((X;)*) = 1. It follows by linearity that E (Z) =1+---+1l=n.0 


The density function of the chi-squared distribution is a bit harder to obtain. We 
begin with the case n = 1. 


Theorem 4.6.4 Let Z ~ 7(1). Then 


1/2 
Oe sen CEP z7 1/2e7z/2 


oa wy) 


forz > 0, with fz(z) = 0 forz < 0. That is, Z ~ Gamma(1/2, 1/2) (using 
T (1/2) = y7). 


Chapter 4: Sampling Distributions and Limits 237 


PROOF | Because Z ~ y7(1), we can write Z = X*, where X ~ N(0, 1). We then 
compute that, for z > 0, 


a fzls)ds = P(Z <2) = PW? <2) =P(-Wz < X < 4/2). 


But because ¥ ~ N(0, 1) with density function ¢(s) = (2x)! es 2, we can 
rewrite this as 


I fz(s)ds = f os = [Toas - [T poas 


Because this is true for all z > 0, we can differentiate with respect to z (using the 
fundamental theorem of calculus and the chain rule) to obtain 
1 —1 


Sz) = WAAS - NE 


1 
et?) = 


1 
= = —2/2 
( vZ) = Jna 2 


as claimed. E 


In Figure 4.6.1, we have plotted the y7(1) density. Note that the density becomes 
infinite at 0. 


0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 


Figure 4.6.1: Plot of the 77(1) density. 


Theorem 4.6.5 Let Z ~ 77(n). Then Z ~ Gamma(n/2, 1/2). That is, 


1 


(n/2)-1 ,—2/2 
PTa) S 


Jz) = 


for z > 0, with fz(z) = 0 forz < 0. 


238 Section 4.6: Normal Distribution Theory 


Because Z ~ 77(n), we can write Z = AG + X3 +.--+X2, where the X; 
are i.i.d. N (0, 1). But this means that X? are iid. y? (1). Hence, by Theorem 4.6.4, 
we have K iid. Gamma(1/2, 1/2) fori = 1,2,...,n. Therefore, Z is the sum of n 
independent random variables, each having distribution Gamma(1/2, 1/2). 

Now by Appendix C (see Problem 3.4.20), the moment-generating function of a 
Gamma(a, f) random variable is given by m(s) = p“ (B —s)~™® fors < p. Putting a = 
1/2 and 6 = 1/2, and applying Theorem 3.4.5, the variable Y = xX? + X3 +-+ PG 
has moment-generating function given by 


n n 1 1/2 1 —1/2 1 n/2 1 —n/2 
m= PaM) G =) G-) 
i=l 


i=l 
fors < 1/2. We recognize this as the moment-generating function of the Gamma (n /2, 
1/2) distribution. Therefore, by Theorem 3.4.6, we have that X + X2 +-+ X ~ 
Gamma(n /2, 1/2), as claimed. 
This result can also be obtained using Problem 2.9.15 and induction. E 


Note that the y7(2) density is the same as the Exponential(2) density. In Figure 
4.6.2, we have plotted several y? densities. Observe that the y? are asymmetric and 
skewed to the right. As the degrees of freedom increase, the central mass of probability 
moves to the right. 


Figure 4.6.2: Plot of the 773) (solid line) and the L) (dashed line) density functions. 


One application of the chi-squared distribution is the following. 


Theorem 4.6.6 Let X1, ..., Xn be iid. N (u, 07). Put 


Wi los 3 
X=-(X + + Xn) and F = — Y X: — xy. 
n alia 


Then (n — 1) S?/o? ~ y?(n — 1), and furthermore, $? and_X are independent. 


Chapter 4: Sampling Distributions and Limits 239 


PROOF | See Section 4.7 for the proof of this result. E 


Because the y(n — 1) distribution has mean n — 1, we obtain the following. 


Corollary 4.6.2 E(S*) = o°. 


PROOF | Theorems 4.6.6 and 4.6.3 imply that E((n — 1) S*/o*) =n — 1 and that 
E(S*) =0?.10 


Theorem 4.6.6 will find extensive use in Chapter 6. For example, this result, to- 
gether with Corollary 4.6.1, gives us the joint sampling distribution of the sample mean 
X and the sample variance S* when we are sampling from an N (u, a?) distribution. If 
we do not know y, then_X is a natural estimator of this quantity and, similarly, S? is a 
natural estimator of o°, when it is unknown. Interestingly, we divide by n — 1 in S?, 
rather than n, precisely because we want E (S?) = c? to hold, as in Corollary 4.6.2. 
Actually, this property does not depend on sampling from a normal distribution. It can 
be shown that anytime X),..., X,, is a sample from a distribution with variance o?, 
then E(S*) = o?. 


4.6.2 | The ¢ Distribution 


The ¢ distribution also has many statistical applications. 


Definition 4.6.2 The ¢ distribution with n degrees of freedom (or Student(n) , or 
t(n)), is the distribution of the random variable 


X 


JSE 4X2) /n- 


where X, X1,..., Xn are i.i.d., each with the standard normal distribution N (0, 1). 
(Equivalently, Z = X/./Y/n, where Y ~ 7?(n).) 


L = 


Most statistical packages have built-in routines for the evaluation of t(n) probabilities 
(also see Table D.4 in Appendix D). 
The density of the ¢() distribution is given by the following result. 


Theorem 4.6.7 Let U ~ t(n). Then 


=e 1 


Jn 


for allu e R!. 


PROOF | For the proof of this result, see Section 4.7. E 


The following result shows that, when n is large, the ¢ (n) distribution is very similar 
to the N (0, 1) distribution. 


240 Section 4.6: Normal Distribution Theory 


Theorem 4.6.8 As n — oo, the ¢(n) distribution converges in distribution to a 


standard normal distribution. 


PROOF | Let Z),..., Zn, Z be iid. N(O, 1). Asn — œ, by the strong law of large 
numbers, (z? +-+ Z2) /n converges with probability 1 to the constant 1. Hence, the 
distribution of 


Z 


n OE (4.6.2) 
(A++ ZA) in 


converges to the distribution of Z, which is the standard normal distribution. By Defi- 
nition 4.6.2, we have that (4.6.2) is distributed ¢ (n). B 


In Figure 4.6.3, we have plotted several ¢ densities. Notice that the densities of the 
t distributions are symmetric about 0 and look like the standard normal density. 


-10 -8 6 4 2 0 2 4 6 8 10 
u 


Figure 4.6.3: Plot of the ¢ (1) (solid line) and the ¢ (30) (dashed line) density functions. 


The ¢ (n) distribution has longer tails than the N (0, 1) distribution. For example, the 
t(1) distribution (also known as the Cauchy distribution) has 0.9366 of its probability 
in the interval (—10, 10), whereas the N(0, 1) distribution has all of its probability 
there (at least to four decimal places). The ¢(30) and the N(0, 1) densities are very 
similar. 


4.6.3 | The F Distribution 


Finally, we consider the F distribution. 


Chapter 4: Sampling Distributions and Limits 241 


Definition 4.6.3 The F distribution with m and n degrees of freedom (or F (m, n)) 
is the distribution of the random variable 


+X + +X) / 


a Oy 


where X1,..., Xm, %,..., Yn arei.i.d., each with the standard normal distribution. 
(Equivalently, Z = (X/m)/(Y/n), where X ~ y7(m) and Y ~ y?(n).) 


Most statistical packages have built-in routines for the evaluation of F (m, n) probabil- 
ities (also see Table D.5 in Appendix D). 
The density of the F (m, n) distribution is given by the following result. 


Theorem 4.6.9 Let U ~ F(m,n). Then 


Tr (2) m ae (ee e 


fuu) = Tere P 


n 


for u > 0, with fy (u) = 0 foru < 0. 


PROOF | For the proof of this result, see Section 4.7. E 


In Figure 4.6.4, we have plotted several F (m, n) densities. Notice that these densities 
are skewed to the right. 


f I 
0.6 4L \ 
l 
T 
oat \ 
\ 
\ 
\ 
0.274 \ 
N 
T N 
0.0 H SS St 
0 1 2 3 4 5 6 T 8 9 10 


Figure 4.6.4: Plot of the F (2, 1) (solid line) and the F (3, 10) (dashed line) density functions. 


The following results are useful when it is necessary to carry out computations with 
the F (m, n) distribution. 


Theorem 4.6.10 If Z ~ F(m,n), then 1/Z ~ F(n,m). 


242 Section 4.6: Normal Distribution Theory 


PROOF | Using Definition 4.6.3, we have 


1 OPY +--+ 2) /n 


Z CO? a) fon 
and the result is immediate from the definition. E 


Therefore, if Z ~ F(m,n), then P(Z < z) = P(1/Z > 1/z) = 1 — P(1/Z < 1/z), 
and P(1/Z < 1/z) is the cdf of the F (n, m) distribution evaluated at 1 /z. 

In many statistical applications, n can be very large. The following result then gives 
a useful approximation for that case. 


Theorem 4.6.11 If Z, ~ F(m,n), then mZ,, converges in distribution to a xm) 
distribution as n > oo. 


PROOF | Using Definition 4.6.3, we have 
XX HoA, 
(2HY ++ +YD/n 
By Definition 4.6.1, X? + --- + X2, ~ x° (m). By Theorem 4.6.3, E (Y?) = 1, so the 


strong law of large numbers implies that (Y, 2 + yz +e +Y, 2) /n converges almost 
surely to 1. This establishes the result. E 


Finally, Definitions 4.6.2 and 4.6.3 immediately give the following result. 


Theorem 4.6.12 If Z ~ t(n), then Z? ~ F(1,n). 


Summary of Section 4.6 


e Linear combinations of independent normal random variables are also normal, 
with appropriate mean and variance. 

e Two linear combinations of the same collection of independent normal random 
variables are independent if and only if their covariance equals 0. 

e The chi-squared distribution with n degrees of freedom is the distribution corre- 
sponding to a sum of squares of n i.i.d. standard normal random variables. It has 
mean n. It is equal to the Gamma(n/2, 1/2) distribution. 


The ¢ distribution with n degrees of freedom is the distribution corresponding to 
a standard normal random variable, divided by the square-root of 1/n times an 
independent chi-squared random variable with n degrees of freedom. Its density 
function was presented. As n — oo, it converges in distribution to a standard 
normal distribution. 

e The F distribution with m and n degrees of freedom is the distribution corre- 
sponding to m/n times a chi-squared distribution with m degrees of freedom, 
divided by an independent chi-squared distribution with n degrees of freedom. 


Chapter 4: Sampling Distributions and Limits 243 


Its density function was presented. If t has a ¢(n) distribution, then z? is distrib- 
uted F(1, n). 


EXERCISES 


4.6.1 Let Xı ~ N(3, 27) and X2 ~ N(—8, 5) be independent. Let U = X; — 5X 
and V = —6X, + CX, where C is a constant. 
(a) What are the distributions of U and V? 
(b) What value of C makes U and V be independent? 
4.6.2 Let X ~ N(3,5) and Y ~ N(—7, 2) be independent. 
(a) What is the distribution of Z = 4X — Y/3? 
(b) What is the covariance of X and Z? 
4.6.3 Let X ~ N(3,5) and Y ~ N(-—7, 2) be independent. Find values of Cı 4 
0, C2, C3 #0, C4, Cs so that C1 (X + C2)? + C3(¥ + Ca)? ~ x? (C53). 
4.6.4 Let X ~ y(n) and Y ~ N(O,1) be independent. Prove that X + Y? ~ 
x(n + 1). 
4.6.5 Let X ~ x(n) and Y ~ y*(m) be independent. Prove that X +Y ~ x(n +m). 
4.6.6 Let X1, X2,..., X4n be i.i.d. with distribution N (0, 1). Find a value of C such 
that 

AE + +X 


2 2, 2 
Xa+ + X42 prse + X4, 


~ F(n, 3n). 
4.6.7 Let X1, X2,..., Xn41 bei.id. with distribution N (0, 1). Find a value of C such 


that 
X 
Ce a; 


2 2 
XB S 


4.6.8 Let X ~ N@,5) and Y ~ N(—7, 2) be independent. Find values of C1, C2, C3, 
C4, Cs, C6 so that 
CX + C2)% 
(Y + C4)?)% 


4.6.9 Let X ~ N@,5) and Y ~ N(-—7, 2) be independent. Find values of C1, C2, C3, 
C4, Cs, C6, C7 so that 


~ t(C6). 


CX + C2)% 


“+O ~ F (Co, C7). 


4.6.10 Let X1, X2,..., X100 be independent, each with the standard normal distribu- 
tion. 


(a) Compute the distribution of X7. 

(b) Compute the distribution of X: 3 +x A 

(c) Compute the distribution of X19/,/[X39 + X3 + X29173. 
(d) Compute the distribution of 3X2, / [X39 + X2 + XGol- 


244 Section 4.6: Normal Distribution Theory 


(e) Compute the distribution of 


30 X24 XZ +--+ X4, 
7 y2 2 DT 
70 Xi +X +--+ + Xio 


4.6.11 Let X1, X2,..., X61 be independent, each distributed as N (u, 07). Set X = 
(1/61)(X + X2 +++» + X61) and 


5 =o[ -PP + 2 — YP + + Kor — | 


as usual. 

(a) For what values of K and m is it true that the quantity Y = K (X — u)/ VSZ has at 
distribution with m degrees of freedom? 

(b) With K as in part (a), find y such that P(Y > y) = 0.05. 

(c) For what values of a and b and c is it true that the quantity W = a (X — )*/S* has 
an F distribution with b and c degrees of freedom? 

(d) For those values of a and b and c, find a quantity w so that P(W > w) = 0.05. 
4.6.12 Suppose the core temperature (in degrees celsius, when used intensively) of 
the latest Dell desktop computer is normally distributed with mean 40 and standard 
deviation 5, while for the latest Compaq it is normally distributed with mean 45 and 
standard deviation 8. Suppose we measure the Dell temperature 20 times (on separate 
days) and obtain measurements D1, D2,..., D20, and we also measure the Compaq 
temperature 30 times and obtain measurements C1, C2, ..., C30. 

(a) Compute the distribution of D = (Dı +--- + D20)/20. 

(b) Compute the distribution of C = (C1 + --- + C30)/30. 

(c) Compute the distribution of Z = C — D. 

(d) Compute P(C < D). 

(e) Let U = (Dı — DP + (D2 — D}? + --- + (D2 — D)*. What is P(U > 633.25)? 


PROBLEMS 


4.6.13 Let X ~ N(O, 1), and let P(Y = 1) = P(Y = —1) = 1/2. Assume X and Y 
are independent. Let Z = XY. 

(a) Prove that Z ~ N(0, 1). 

(b) Prove that Cov(X, Z) = 0. 

(c) Prove directly that X and Z are not independent. 

(d) Why does this not contradict Theorem 4.6.2? 

4.6.14 Let Z ~ t(n). Prove that P(Z < —x) = P(Z > x) for x e R!, namely, prove 
that the ¢ (n) distribution is symmetric about 0. 

4.6.15 Let X, ~ F(n, 2n) forn = 1,2,3,.... Prove that X, — 1 in probability and 
with probability 1. 

4.6.16 (The general chi-squared distribution) Prove that for a > 0, the function 


1 
= (a/2)—1 „—z/2 
JO = aTa : 


Chapter 4: Sampling Distributions and Limits 245 


defines a probability distribution on (0, o0) . This distribution is known as the y? (a) 
distribution, i.e., it generalizes the distribution in Section 4.6.2 by allowing the degrees 
of freedom to be an arbitrary positive real number. (Hint: The x (a) distribution is the 
same as a Gamma (a /2, 1/2) distribution.) 


4.6.17 (MV) (The general t distribution) Prove that for a > 0, the function 


r (4 2\ —(a+1)/2 
fu)= r(#) (: + £) £ 


Frata Ja 
defines a probability distribution on (—oo, oo) by showing that the random variable 
X 
Via 


has this density when X ~ N (0, 1) independent of Y ~ y? (a), as in Problem 4.6.16. 
This distribution is known as the ź (a) distribution, i.e., it generalizes the distribution 
in Section 4.6.3 by allowing the degrees of freedom to be an arbitrary positive real 
number. (Hint: The proof is virtually identical to that of Theorem 4.6.7.) 


4.6.18 (MV) (The general F distribution) Prove that for a > 0, 8 > 0, the function 


r (Æ (a/2)-1 —(a+p)/2 
jo oy (G) C ae 


r(g)r(§ f B 
defines a probability distribution on (0, oo) by showing that the random variable 
_ X/a 
Y/B 


has this density whenever X ~ x(a) independent of Y ~ xB) as in Problem 
4.6.16. This distribution is known as the F (a, p) distribution, i.e., it generalizes the 
distribution in Section 4.6.4 by allowing the numerator and denominator degrees of 
freedom to be arbitrary positive real numbers. (Hint: The proof is virtually identical to 
that of Theorem 4.6.9). 

4.6.19 Prove that when X ~ t(a), as defined in Problem 4.6.17, and a > 1, then 
E(X) = 0. Further prove that when a > 2, Var(X) = a/ (a — 2). You can assume 
the existence of these integrals — see Challenge 4.6.21. (Hint: To evaluate the second 
moment, use Y = X? ~ F (1, a) as defined in Problem 4.6.18.) 


4.6.20 Prove that when X ~ F(a, f), then E(X) = B/(6 —2) when f > 2 and 
Var(X) = 28? (a + B — 2)/a(B — 2)* (B — 4) when £ > 4. 


CHALLENGES 


4.6.21 Following Problem 4.6.19, prove that the mean of X does not exist whenever 
0 <a < 1. Further prove that the variance of X does not exist whenever 0 < a < 1 
and is infinite when 1 < a < 2. 


4.6.22 Prove the identity (4.7.1) in Section 4.7, which arises as part of the proof of 
Theorem 4.6.6. 


246 Section 4.7: Further Proofs (Advanced) 


4.7 | Further Proofs (Advanced) 
Proof of Theorem 4.3.1 


We want to prove the following result. Let Z, Z1, Z2, ... be random variables. Suppose 
Zn > Z with probability 1. Then Zn — Z in probability. That is, if a sequence of 
random variables converges almost surely, then it converges in probability to the same 
limit. 

Assume P(Z, > Z) = 1. Fixe > 0, and let A, = {s : |Z — Z| > € for some 
m > n}. Then {4n} is a decreasing sequence of events. Furthermore, ifs € Me Ans 
then Z, (s) > Z(s) asn — oo. Hence, 


P (Xi An) < P(Zn $ Z)=0. 


By continuity of probabilities, we have lim„—>oo P (4n) = P(OX 4n) = 0. Hence, 
P(|Zn — Z| > €) < P(An) > Qas n > œ. Because this is true for any € > 0, we 
see that Z, — Z in probability. E 


Proof of Theorem 4.4.1 


We show that if Xn 4 X, then Xn E X. 

Suppose X, —> X in probability and that P(X = x) = 0. We wish to show that 
limno P(Xn < x) = P(X < x). 

Choose any € > 0. Now, if X, < x, then we must have either X < x + € or 
|X — X,| > €. Hence, by subadditivity, 


P(Xn <x) < P(X <x+04+P(\X — Xı| > ©). 
Replacing x by x — e€ in this equation, we see also that 
P(X <x =€) < P(Xn < x) + P(X — Xıl 2 ©). 
Rearranging and combining these two inequalities, we have 
P(X < x—e)— P (|X —Xn| > €) < P(Xn < x) < P(X < x+6)+P (X -X,l| > ©). 


This is the key. 
We next let n —> oo. Because X,, > X in probability, we know that 


lim P(X —X,| > 6) =0. 
noo 


This means that limy59, P(Xn < x) is “sandwiched” between P(X < x — €) and 
P(X <x+e). 
We then let € N 0. By continuity of probabilities, 


lim P(X < x +€) = P(X <x) and lim P(X < x —€) = P(X <x). 
eNO eNO 


This means that limy_59, P(Xn < x) is “sandwiched” between P(X < x) and P(X < 
x). 

But because P(X = x) = 0, we must have P(X < x) = P(X < x). Hence, 
limy 500 P(X, < x) = P(X < x), as required. E 


Chapter 4: Sampling Distributions and Limits 247 


Proof of Theorem 4.4.3 (The central limit theorem) 


We must prove the following. LetX,, X2,... be i.i.d. with finite mean u and finite 
variance 0”. Let Z ~ N(O, 1). Set Sn = Xı +--+» + Xn, and 


_ Sanu 


Zn 5 


no 


Then as n + œ, the sequence {Zn} converges in distribution to the Z, i.e., Zn 5 Z. 

Recall that the standard normal distribution has moment-generating function given 
by mz(s) = exp (s?/2). 

We shall now assume that mz, (s) is finite for |s| < so for some so > 0. (This 
assumption can be eliminated by using characteristic functions instead of moment- 
generating functions.) Assuming this, we will prove that for each real number s, we 
have lim, mz, (s) = mz(s), where m z, (s) is the moment-generating function of 
Zn. It then follows from Theorem 4.4.2 that Z, converges to Z in distribution. 

To proceed, let Y; = (X; — u) /o. Then E(Y;) = 0 and E(Y?) =Var(¥;) = 1. 
Also, we have 

Zn = wa +++ Yn). 
Let my(s) = E(e°) be the moment-generating function of Y; (which is the same for 
alli, because they are i.i.d.). Then using independence, we compute that 


lim mz,(s) = lim E (e7) = lim E (eeen) 
noo noo no 


= lim E (ane PINT.. „eI VF) 


n—> 00 


ig, E (EA) e (EM) 8 (248 
= lim my(s/vn)my(s/ vn): -my(s/ vn) 
lim my(s//n)". 


Now, we know from Theorem 3.5.3 that my(0) = E(e°) = 1. Also, my (0) = 
E(Y;) = 0 and m% (0) = EY?) = |. But then expanding my (s) in a Taylor series 
around s = 0, we see that 


1 
my(s) =1+0s + T + 0(s*) = 1+57/2 + 0(s”), 


where o(s?) stands for a quantity that, as s —> 0, goes to 0 faster than s? does — 
namely, 0(s*)/s — 0 as s => 0. This means that 


my (s/n) = 1 + (s/n)? /2 + 0((s/Vny’) = 1+87/2n + 0(1/n), 


where now o(1/n) stands for a quantity that, as n — oo, goes to 0 faster than 1/n 
does. 


248 Section 4.7: Further Proofs (Advanced) 


Finally, we recall from calculus that, for any real number c, limy5o.(1 + c/n)” = 
e°. It follows from this and the above that 


lim (my (s/2/n))" = lim (1 +s2/2n)" -” el. 
n> oo n—-> 00 
That is, limp — oo mz, (s) = e” 2 as claimed. E 


Proof of Theorem 4.6.2 


We prove the following. Suppose Xi ~ N (u;i, o?) for i =1,2,...,n and also that the 
{X;} are independent. Let U = >~?_, a;X; and V = 3~?_, b; Xj, for some constants 
{aj} and {bj}. Then Cov(U, V) = >; ajbjo?. Furthermore, Cov(U, V) = 0 if and 
only if U and V are independent. 

It was proved in Section 4.6 that Cov(U, V) = X; ajbjo? and that Cov(U, V) = 0 
if U and V are independent. It remains to prove that, if Cov(U, V) = 0, then U and V 
are independent. For simplicity, we take n = 2 and uw; = u2 = 0 and o? = oF = 
the general case is similar but messier. We therefore have 


U=a,X,+aX2 and V =bıXı + b2:X2. 


The Jacobian derivative of this transformation is 
J1, x2) = ae = ab — bian. 


Inverting the transformation gives 


E baU — aV aV — biU 


1 and X2 = 


~ aiba — biaz aib — biaz ` 
Also, 
1 
frx (01, x2) = eID) 2, 
; 2m 
Hence, from the multidimensional change of variable theorem (Theorem 2.9.2), we 
have 


bu — azv ajv — byu -] 
= a ee 1 
Su,v (u,v) FX, ,Xq (01, x2) (= Sie aie r) |J Œ1, x2)| 
1 exp{—((2u — axo}? + (aiw — byu)”) /2(a1b2 — b1a2)°} 
~ 2m |ajbz — byap| l 
But 


(bu — av)? + (ajv — biu)? = (bi + b5)u? + (a; + a?o? — 2(a,b, + a2b2)uv 
and Cov(U, V) = a,b; +a2b2. Hence, if Cov(U, V) = 0, then 


(bzu — av)” + (aiw — biu)? = (bf +.b3)u* + (a? + ahjo? 


Chapter 4: Sampling Distributions and Limits 249 


and 


exp{—((b7 + be yu? + (a? + a3)v")/2(aib2 — bya2)*} 
fry Falaibz= ba 

__ exp{—(b} +.b3)u? /2(aibs — b1a2)"} exp{—(aj + a3)0*/2(aib2 — b1a2)”} 
= 27 |aib2 — biaz] 


It follows that we can factor fy,y (u, v) as a function of u times a function of v. But 
this implies (see Problem 2.8.19) that U and V are independent. E 


Proof of Theorem 4.6.6 


We want to prove that when X\,..., Xn are i.i.d. N (u, o?) and 
y 1 2 1l y2 
X=-(X;+---+X,) and S =— 9; -X), 
n n-1 £ 


then (n — 1) S/o? ~ y?(n — 1) and, furthermore, that S* and X are independent. 


We have ae 
ek ree ne bees 
a= (A). 


i=l 


We rewrite this expression as (see Challenge 4.6.22) 


ñ= lay 
S 
o2 
- (4) (tats) (Staten) 
oV2 oV/2 +3 oV3-4 
Metter hn)! 
Ae cig i ee | 4.7.1 
( aoJ/(n — l)n € ) 


Now, by Theorem 4.6.1, each of the n — 1 expressions within brackets in (4.7.1) 
has the standard normal distribution. Furthermore, by Theorem 4.6.2, the expressions 
within brackets in (4.7.1) are all independent of one another and are also all indepen- 
dent of X. 

It follows that (n — 1) S*/o? is independent of X. It also follows, by the definition 
of the chi-squared distribution, that (n — 1) S? 7 o? ~ y(n —1).U 


Proof of Theorem 4.6.7 


We want to show that when U ~ t(n), then 


T(t 2\ =(1+1)/2 
folu) = Hie): (1 + £) a 
vaT (5) n vn 


for allu e R!. 


250 Section 4.7: Further Proofs (Advanced) 


Because U ~ t(n), we can write U = X/,/Y/n, where X and Y are independent 
with X ~ N(0, 1) and Y ~ x(n). It follows that X and Y have joint density given by 


e=% /2y 0/2- p-y/2 

~ 2022/27 (3) 
when y > 0 (with fy,y œ, vy) = 0 for y < 0). 

Let V = Y. We shall use the multivariate change of variables formula (Theo- 
rem 2.9.2) to compute the joint density fu,y (u, v) of U and V. Because U = X/./Y/n 


and V = Y, it follows that X = U./V /n and Y = V. We compute the Jacobian term 
as 


Sxy(x,y) = 


x 5 1 
ox or vy/n 9 1 
J(x, y) = det = det = . 
ĉu dv -xn 1 Vvy/n 
oy oy y3/2 


Hence, 


fu,v (u,v) 


Txy (f2) JT! (f2) 
n n 
Se 
~ Smar (2) Vn 
1 1 1 (n+1)/2-1 o—(0/2)(1+u?/n) 


~ aT aD HOP yn 


for v > 0 (with fu,y(u, v) = 0 for v < 0). 
Finally, we compute the marginal density of U: 


Gu = f” forta 


1 1 1 ~~ 2 

ee eR A Se (n+1)/2—-1 ,—(/2)(1+u /n) 

= D e v 
JVaT (n/2) 20+)/2 Tae 


i By OP EO eaten se 
a 1 eee een n+ = —w d 
mo) Zl, ig oe 
r (4) u2\T0+D/2 } 
VaT (n/2) n Jn’ 


where we have made the substitution w = (1 + u?/n) v/2 to get the third equality and 
then used the definition of the gamma function to obtain the result. E 


Proof of Theorem 4.6.9 


We want to show that when U ~ F(m,n), then 


T(=) øm \0/D- g m yo+) m 
EE (3) P (3) Z (1+ z“) 7 


Chapter 4: Sampling Distributions and Limits 251 


for u > 0, with fu(u) = 0 foru <0. 

Because U ~ F(n,m), we can write U = (X/m)/(Y/n), where X and Y are 
independent with X ~ y?(m) and Y ~ x7(n). It follows that X and Y have joint 
density given by 


Tae a As a 


x,y) = 
Are a AT (8) 
when x, y > 0 (with fy, y@,y)=0forx <0ory <0). 

Let V = Y, and use the multivariate change of variables formula (Theorem 2.9.2) 
to compute the joint density fy,y (u, v) of U and V. Because U = (X/m)/(Y/n) and 
V = Y, it follows that X = (m/n)UV and Y = V. We compute the Jacobian term as 


ou ov aL 0 

Ox Ox my n 
J(x, y) = det = det =—. 

ou ov -nX 1 my 

ôy oy mY? 


Hence, 


fuvu, o) = fxy(m/n)uv, v) J (Gn /n)uv, v) 
(Zuo) e7(m/n)(uv/2) y(n/2)-1g—-(0/2) m 


- 2 (4) 2 PT G) n 
S (Za) Dim l _ (m+m)/2-1 6-0/2 (1+mu/n) 
T(z) (2) \n HENE 


foru, v > 0 (with fu,y(u, v) =0 foru < O orv < 0). 
Finally, we compute the marginal density of U as 


fulu) 
=J furat 


1 m \(m/2)-1 m 1 X 
E N (lhe ao ar (m+n)/2—1 „—(@/2)(1+mu/n) 
j ( u) 7 N. v e dv 


m \(m/2)-1 m \-(n+m)/2m f 
arg (mH) (1 Bay E [T oeae eao 
rar n n o 

m+ 


m \(m/2)-1 m \—n+m)/2 m 
GA 07) P 
) n n n 


where we have used the substitution w = (1 + mu/n)v/2 to get the third equality, and 
the final result follows from the definition of the gamma function. E 


Chapter 5 
Statistical Inference 


CHAPTER OUTLINE 


Section 1 Why Do We Need Statistics? 
Section 2 Inference Using a Probability Model 
Section 3 Statistical Models 

Section 4 Data Collection 

Section 5 Some Basic Inferences 


In this chapter, we begin our discussion of statistical inference. Probability theory is 
primarily concerned with calculating various quantities associated with a probability 
model. This requires that we know what the correct probability model is. In applica- 
tions, this is often not the case, and the best we can say is that the correct probability 
measure to use is in a set of possible probability measures. We refer to this collection as 
the statistical model. So, in a sense, our uncertainty has increased; not only do we have 
the uncertainty associated with an outcome or response as described by a probability 
measure, but now we are also uncertain about what the probability measure is. 

Statistical inference is concerned with making statements or inferences about char- 
acteristics of the true underlying probability measure. Of course, these inferences must 
be based on some kind of information; the statistical model makes up part of it. Another 
important part of the information will be given by an observed outcome or response, 
which we refer to as the data. Inferences then take the form of various statements about 
the true underlying probability measure from which the data were obtained. These take 
a variety of forms, which we refer to as types of inferences. 

The role of this chapter is to introduce the basic concepts and ideas of statistical 
inference. The most prominent approaches to inference are discussed in Chapters 6, 
7, and 8. Likelihood methods require the least structure as described in Chapter 6. 
Bayesian methods, discussed in Chapter 7, require some additional ingredients. Infer- 
ence methods based on measures of performance and loss functions are described in 
Chapter 8. 


253 


254 Section 5.1: Why Do We Need Statistics? 


5.1 | Why Do We Need Statistics? 


While we will spend much of our time discussing the theory of statistics, we should 
always remember that statistics is an applied subject. By this we mean that ultimately 
statistical theory will be applied to real-world situations to answer questions of practical 
importance. 

What is it that characterizes those contexts in which statistical methods are useful? 
Perhaps the best way to answer this is to consider a practical example where statistical 
methodology plays an important role. 


EXAMPLE 5.1.1 Stanford Heart Transplant Study 

In the paper by Turnbull, Brown, and Hu entitled “Survivorship of Heart Transplant 
Data” (Journal of the American Statistical Association, March 1974, Volume 69, 74— 
80), an analysis is conducted to determine whether or not a heart transplant program, 
instituted at Stanford University, is in fact producing the intended outcome. In this case, 
the intended outcome is an increased length of life, namely, a patient who receives a 
new heart should live longer than if no new heart was received. 

It is obviously important to ensure that a proposed medical treatment for a disease 
leads to an improvement in the condition. Clearly, we would not want it to lead to a 
deterioration in the condition. Also, if it only produced a small improvement, it may 
not be worth carrying out if it is very expensive or causes additional suffering. 

We can never know whether a particular patient who received a new heart has lived 
longer because of the transplant. So our only hope in determining whether the treat- 
ment is working is to compare the lifelengths of patients who received new hearts with 
the lifelengths of patients who did not. There are many factors that influence a patient’s 
lifelength, many of which will have nothing to do with the condition of the patient’s 
heart. For example, lifestyle and the existence of other pathologies, which will vary 
greatly from patient to patient, will have a great influence. So how can we make this 
comparison? 

One approach to this problem is to imagine that there are probability distributions 
that describe the lifelengths of the two groups. Let these be given by the densities fr 
and fc, where T denotes transplant and C denotes no transplant. Here we have used 
C as our label because this group is serving as a control in the study to provide some 
comparison to the treatment (a heart transplant). Then we consider the lifelength of a 
patient who received a transplant as a random observation from fr and the lifelength of 
a patient who did not receive a transplant as a random observation from fc. We want 
to compare f7 and fc, in some fashion, to determine whether or not the transplant 
treatment is working. For example, we might compute the mean lifelengths of each 
distribution and compare these. If the mean lifelength of fr is greater than fc, then 
we can assert that the treatment is working. Of course, we would still have to judge 
whether the size of the improvement is enough to warrant the additional expense and 
patients’ suffering. 

If we could take an arbitrarily large number of observations from fr and fc, then 
we know, from the results in previous chapters, that we could determine these distribu- 
tions with a great deal of accuracy. In practice, however, we are restricted to a relatively 
small number of observations. For example, in the cited study there were 30 patients 


Chapter 5: Statistical Inference 255 


aananaoanaaa 
eeooaaoaanananana 


1 
2 
3 
4 
5 
6 
7 
8 
9 
0 


jà 


Table 5.1: Survival times (X) in days and status (S) at the end of the study for each 
patient (P) in the control group. 


in the control group (those who did not receive a transplant) and 52 patients in the 
treatment group (those who did receive a transplant). 

For each control patient, the value of X — the number of days they were alive 
after the date they were determined to be a candidate for a heart transplant until the 
termination date of the study — was recorded. For various reasons, these patients did 
not receive new hearts, e.g., they died before a new heart could be found for them. 
These data, together with an indicator for the status of the patient at the termination 
date of the study, are presented in Table 5.1. The indicator value S = a denotes that the 
patient was alive at the end of the study and S = d denotes that the patient was dead. 

For each treatment patient, the value of Y, the number of days they waited for the 
transplant after the date they were determined to be a candidate for a heart transplant, 
and the value of Z, the number of days they were alive after the date they received 
the heart transplant until the termination date of the study, were both recorded. The 
survival times for the treatment group are then given by the values of Y + Z. These 
data, together with an indicator for the status of the patient at the termination date of 
the study, are presented in Table 5.2. 

We cannot compare fr and fc directly because we do not know these distributions. 
But we do have some information about them because we have obtained values from 
each, as presented in Tables 5.1 and 5.2. So how do we use these data to compare fr 
and fc to answer the question of central importance, concerning whether or not the 
treatment is effective? This is the realm of statistics and statistical theory, namely, pro- 
viding methods for making inferences about unknown probability distributions based 
upon observations (samples) obtained from them. 

We note that we have simplified this example somewhat, although our discussion 
presents the essence of the problem. The added complexity comes from the fact that 
typically statisticians will have available additional data on each patient, such as their 
age, gender, and disease history. As a particular example of this, in Table 5.2 we have 
the values of both Y and Z for each patient in the treatment group. As it turns out, 
this additional information, known as covariates, can be used to make our comparisons 
more accurate. This will be discussed in Chapter 10. E 


256 Section 5.1: Why Do We Need Statistics? 


Fe aa a YA 


Y 


AOADNAHBWNH 


ero ananass apsvnananaDspapjú 


S 
d 
d 
d 
d 
d 
d 
d 
d 
d 
d 
d 
d 
d 
d 
a 
d 
d 
d 


asr aanonannon ansaxsxsnanaas © 


Table 5.2: The number of days until transplant (Y), survival times in days after trans- 
plant (Z), and status (S) at the end of the study for each patient (P) in the treatment 


group. 


The previous example provides some evidence that questions of great practical im- 
portance require the use of statistical thinking and methodology. There are many sit- 
uations in the physical and social sciences where statistics plays a key role, and the 
reasons are just like those found in Example 5.1.1. The central ingredient in all of 
these is that we are faced with uncertainty. This uncertainty is caused both by vari- 
ation, which can be modeled via probability, and by the fact that we cannot collect 
enough observations to know the correct probability models precisely. The first four 
chapters have dealt with building, and using, a mathematical model to deal with the 
first source of uncertainty. In this chapter, we begin to discuss methods for dealing 
with the second source of uncertainty. 


Summary of Section 5.1 


e Statistics is applied to situations in which we have questions that cannot be an- 
swered definitively, typically because of variation in data. 


e Probability is used to model the variation observed in the data. Statistical infer- 
ence is concerned with using the observed data to help identify the true proba- 
bility distribution (or distributions) producing this variation and thus gain insight 
into the answers to the questions of interest. 


Chapter 5: Statistical Inference 257 


EXERCISES 


5.1.1 Compute the mean survival times for the control group and for the treatment 
groups in Example 5.1.1. What do you conclude from these numbers? Do you think 
it is valid to base your conclusions about the effectiveness of the treatment on these 
numbers? Explain why or why not. 

5.1.2 Are there any unusual observations in the data presented in Example 5.1.1? If so, 
what effect do you think these observations have on the mean survival times computed 
in Exercise 5.1.1? 

5.1.3 In Example 5.1.1, we can use the status variable S' as a covariate. What is the 
practical significance of this variable? 

5.1.4 A student is uncertain about the mark that will be received in a statistics course. 
The course instructor has made available a database of marks in the course for a number 
of years. Can you identify a probability distribution that may be relevant to quantifying 
the student’s uncertainty? What covariates might be relevant in this situation? 

5.1.5 The following data were generated from an N (u, 1) distribution by a student. 
Unfortunately, the student forgot which value of u was used, so we are uncertain about 
the correct probability distribution to use to describe the variation in the data. 


0.2 -07 00 -19 07 -0.3 0.3 0.4 


0.3 —0.8 1.5 0.1 03 -0.7 -1.8 0.2 


Can you suggest a plausible value for u? Explain your reasoning. 

5.1.6 Suppose you are interested in determining the average age of all male students 
at a particular college. The registrar of the college allows you access to a database 
that lists the age of every student at the college. Describe how you might answer your 
question. Is this a statistical problem in the sense that you are uncertain about anything 
and so will require the use of statistical methodology? 

5.1.7 Suppose you are told that a characteristic X follows an N (u1, 1) distribution and 
a characteristic Y follows an N (>, 1) distribution where u, and u) are unknown. In 
addition, you are given the results x1,..., Xm of m independent measurements on X 
and y1,..., Yn of n independent measurements on Y. Suggest a method for determin- 
ing whether or not “, and u3 are equal. Can you think of any problems with your 
approach? 

5.1.8 Suppose we know that a characteristic X follows an Exponential(/) distribution 
and you are required to determine / based on i.i.d. observations x1,...,Xn from this 
distribution. Suggest a method for doing this. Can you think of any problems with your 
approach? 


PROBLEMS 


5.1.9 Can you identify any potential problems with the method we have discussed in 
Example 5.1.1 for determining whether or not the heart transplant program is effective 
in extending life? 

5.1.10 Suppose you are able to generate samples of any size from a probability distrib- 
ution P for which it is very difficult to compute P(C) for some set C. Explain how you 


258 Section 5.2: Inference Using a Probability Model 


might estimate P(C) based on a sample. What role does the size of the sample play in 
your uncertainty about how good your approximation is? Does the size of P(C) play a 
role in this? 


COMPUTER PROBLEMS 


5.1.11 Suppose we want to obtain the distribution of the quantity Y = X4 + 2X? —3 
when X ~ N(0, 1). Here we are faced with a form of mathematical uncertainty because 
it is very difficult to determine the distribution of Y using mathematical methods. Pro- 
pose a computer method for approximating the distribution function of Y and estimate 
P(Y e (1, 2)). What is the relevance of statistical methodology to your approach? 


DISCUSSION TOPICS 


5.1.12 Sometimes it is claimed that all uncertainties can and should be modeled using 
probability. Discuss this issue in the context of Example 5.1.1, namely, indicate all the 
things you are uncertain about in this example and how you might propose probability 
distributions to quantify these uncertainties. 


5.2 | Inference Using a Probability Model 


In the first four chapters, we have discussed probability theory, a good part of which 
has involved the mathematics of probability theory. This tells us how to carry out 
various calculations associated with the application of the theory. It is important to 
keep in mind, however, our reasons for introducing probability in the first place. As 
we discussed in Section 1.1, probability is concerned with measuring or quantifying 
uncertainty. 

Of course, we are uncertain about many things, and we cannot claim that prob- 
ability is applicable to all these situations. Let us assume, however, that we are in 
a situation in which we feel probability is applicable and that we have a probability 
measure P defined on a collection of subsets of a sample space S for a response s. 

In an application of probability, we presume that we know P and are uncertain 
about a future, or concealed, response value s € S. In such a context, we may be 
required, or may wish, to make an inference about the unknown value of s. This can 
take the form of a prediction or estimate of a plausible value for s, e.g., under suitable 
conditions, we might take the expected value of s as our prediction. In other contexts, 
we may be asked to construct a subset that has a high probability of containing s and 
is in some sense small, e.g., find the region that contains at least 95% of the probability 
and has the smallest size amongst all such regions. Alternatively, we might be asked 
to assess whether or not a stated value so is an implausible value from the known P, 
e.g.,assess whether or not so lies in a region assigned low probability by P and so 
is implausible. These are examples of inferences that are relevant to applications of 
probability theory. 


EXAMPLE 5.2.1 
As a specific application, consider the lifelength X in years of a machine where it is 


Chapter 5: Statistical Inference 259 


known that X ~ Exponential(1) (see Figure 5.2.1). 


Figure 5.2.1: Plot of the Exponential(1) density f. 


Then for a new machine, we might predict its lifelength by E(X) = 1 year. Further- 
more, from the graph of the Exponential(1) density, it is clear that the smallest interval 
containing 95% of the probability for X is (0, c) , where c satisfies 


t 
0.95 a e™dx=1—e™ 
0 


orc = — Ìn (0.05) = 2.9957. This interval gives us a reasonable range of probable 
lifelengths for the new machine. Finally, if we wanted to assess whether or not x9 = 5 
is a plausible lifelength for a newly purchased machine, we might compute the tail 
probability as 


lee) 
P(X >5)= i. e™* dx =e~> = 0.0067, 
5 


which, in this case, is very small and therefore indicates that x9 = 5 is fairly far out in 
the tail. The right tail of this density is a region of low probability for this distribution, 
so xọ = 5 can be considered implausible. It is thus unlikely that a machine will last 5 
years, so a purchaser would have to plan to replace the machine before that period is 
over. E 


In some applications, we receive some partial information about the unknown s 
taking the form s € C C S. In sucha case, we replace P by the conditional probability 
measure P(-|C) when deriving our inferences. Our reasons for doing this are many, 
and, in general, we can say that most statisticians agree that it is the right thing to do. It 
is important to recognize, however, that this step does not proceed from a mathematical 
theorem; rather it can be regarded as a basic axiom or principle of inference. We will 
refer to this as the principle of conditional probability, which will play a key role in 
some later developments. 


EXAMPLE 5.2.2 
Suppose we have a machine whose lifelength is distributed as in Example 5.2.1, and 


260 Section 5.2: Inference Using a Probability Model 


the machine has already been running for one year. Then inferences about the lifelength 
of the machine are based on the conditional distribution, given that X > 1. The density 
of this conditional distribution is given by e~°~) for x > 1. The predicted lifelength 
is now 


o0 = o0 
E(X|X > 1) a xe7®D dx = axe FD] +f e7®-D dy = 2. 
1 1 


The fact that the additional lifelength is the same as the predicted lifelength before the 
machine starts working is a special characteristic of the Exponential distribution. This 
will not be true in general (see Exercise 5.2.4). 

The tail probability measuring the plausibility of the value xọ = 5 is given by 


Co 

PAS S| XS ij = e-&—) dx = e~4 = 0.0183, 
5 

which indicates that x9 = 5 is a little more plausible in light of the fact that the machine 

has already survived one year. The shortest interval containing 0.95 of the conditional 

probability is now of the form (1, c), where c is the solution to 


C 
0.95 =f e~@—) dx = e(e7! —e~), 
1 


which implies that c = — In (e7! — 0.95e7!) = 3.9957. U 


Our main point in this section is simply that we are already somewhat familiar with 
inferential concepts. Furthermore, via the principle of conditional probability, we have 
a basic rule or axiom governing how we go about making inferences in the context 
where the probability measure P is known and s is not known. 


Summary of Section 5.2 


e Probability models are used to model uncertainty about future responses. 


e We can use the probability distribution to predict a future response or assess 
whether or not a given value makes sense as a possible future value from the 
distribution. 


EXERCISES 


5.2.1 Sometimes the mode of a density (the point where the density takes its maximum 
value) is chosen as a predictor for a future value of a response. Determine this predictor 
in Examples 5.2.1 and 5.2.2 and comment on its suitability as a predictor. 

5.2.2 Suppose it has been decided to use the mean of a distribution to predict a future 
response. In Example 5.2.1, compute the mean-squared error (expected value of the 
square of the error between a future value and its predictor) of this predictor, prior to 
observing the value. To what characteristic of the distribution of the lifelength does 
this correspond? 


Chapter 5: Statistical Inference 261 


5.2.3 Graph the density of the distribution obtained as a mixture of a normal distribu- 
tion with mean 4 and variance | and a normal distribution with mean —4 and variance 
1, where the mixture probability is 0.5. Explain why neither the mean nor the mode is 
a suitable predictor in this case. (Hint: Section 2.5.4.) 

5.2.4 Repeat the calculations of Examples 5.2.1 and 5.2.2 when the lifelength of a 
machine is known to be distributed as Y = 10X, where X ~ Uniform[0, 1]. 

5.2.5 Suppose that X ~ N(10, 2). What value would you record as a prediction of a 
future value of X? How would you justify your choice? 

5.2.6 Suppose that X ~ N(10, 2). Record the smallest interval containing 0.95 of the 
probability for a future response. (Hint: Consider a plot of the density.) 

5.2.7 Suppose that X ~ Gamma(3, 6). What value would you record as a prediction 
of a future value of X? How would you justify your choice? 

5.2.8 Suppose that X ~ Poisson(5). What value would you record as a prediction of a 
future value of X? How would you justify your choice? 

5.2.9 Suppose that X ~ Geometric(1/3). What value would you record as a prediction 
of a future value of X? 

5.2.10 Suppose that X follows the following probability distribution. 


(a) Record a prediction of a future value of X. 


(b) Suppose you are then told that X > 2. Record a prediction of a future value of X 
that uses this information. 


PROBLEMS 


5.2.11 Suppose a fair coin is tossed 10 times and the response X measured is the 
number of times we observe a head. 

(a) If you use the expected value of the response as a predictor, then what is the predic- 
tion of a future response X? 

(b) Using Table D.6 (or a statistical package), compute a shortest interval containing 
at least 0.95 of the probability for X. Note that it might help to plot the probability 
function of X first. 

(c) What region would you use to assess whether or not a value so is a possible future 
value? (Hint: What are the regions of low probability for the distribution?) Assess 
whether or not x = 8 is plausible. 

5.2.12 In Example 5.2.1, explain (intuitively) why the interval (0, 2.9957) is the short- 
est interval containing 0.95 of the probability for the lifelength. 

5.2.13 (Problem 5.2.11 continued) Suppose we are told that the number of heads ob- 
served is an even number. Repeat parts (a), (b), and (c). 

5.2.14 Suppose that a response X is distributed Beta(a, b) with a,b > 1 fixed (see 
Problem 2.4.16). Determine the mean and the mode (point where density takes its 
maximum) of this distribution and assess which is the most accurate predictor of a 


262 Section 5.3: Statistical Models 


future X when using mean-squared error, i.e., the expected squared distance between 
X and the prediction. 

5.2.15 Suppose that a response X is distributed N (0, 1) and that we have decided to 
predict a future value using the mean of the distribution. 

(a) Determine the prediction for a future X. 

(b) Determine the prediction for a future Y = X?. 

(c) Comment on the relationship (or lack thereof) between the answers in parts (a) and 
(b). 

5.2.16 Suppose that X ~ Geometric(1/3). Determine the shortest interval containing 
0.95 of the probability for a future X. (Hint: Plot the probability function and record 
the distribution function.) 

5.2.17 Suppose that X ~ Geometric(1/3) and we are told that X > 5. What value 
would you record as a prediction of a future value of X? Determine the shortest interval 
containing 0.95 of the probability for a future X. (Hint: Plot the probability function 
and record the distribution function.) 


DISCUSSION TOPICS 


5.2.18 Do you think it is realistic for a practitioner to proceed as if he knows the true 
probability distribution for a response in a problem? 


5.3 | Statistical Models 


In a statistical problem, we are faced with uncertainty of a different character than 
that arising in Section 5.2. In a statistical context, we observe the data s, but we are 
uncertain about P. In such a situation, we want to construct inferences about P based 
on s. This is the inverse of the situation discussed in Section 5.2. 

How we should go about making these statistical inferences is probably not at all 
obvious. In fact, there are several possible approaches that we will discuss in subse- 
quent chapters. In this chapter, we will develop the basic ingredients of all the ap- 
proaches. 

Common to virtually all approaches to statistical inference is the concept of the 
statistical model for the data s. This takes the form of a set {Pg : 0 € Q} of probability 
measures, one of which corresponds to the true unknown probability measure P that 
produced the data s. In other words, we are asserting that there is a random mechanism 
generating s, and we know that the corresponding probability measure P is one of the 
probability measures in {Pg : 0 € Q}. 

The statistical model {Pg : 8 € Q} corresponds to the information a statistician 
brings to the application about what the true probability measure is, or at least what 
one is willing to assume about it. The variable @ is called the parameter of the model, 
and the set Q is called the parameter space. Typically, we use models where 0 € Q 
indexes the probability measures in the model, i.e., Py, = Po, if and only if 0; = 
02. If the probability measures Py can all be presented via probability functions or 
density functions fg (for convenience we will not distinguish between the discrete and 


Chapter 5: Statistical Inference 263 


continuous case in the notation), then it is common to write the statistical model as 
{fo :@6 EQ} 

From the definition of a statistical model, we see that there is a unique value 0 € 
Q, such that Pg is the true probability measure. We refer to this value as the true 
parameter value. It is obviously equivalent to talk about making inferences about the 
true parameter value rather than the true probability measure, i.e., an inference about 
the true value of @ is at once an inference about the true probability distribution. So, for 
example, we may wish to estimate the true value of 0, construct small regions in Q that 
are likely to contain the true value, or assess whether or not the data are in agreement 
with some particular value 0o, suggested as being the true value. These are types of 
inferences, just like those we discussed in Section 5.2, but the situation here is quite 
different. 


EXAMPLE 5.3.1 

Suppose we have an urn containing 100 chips, each colored either black B or white W. 
Suppose further that we are told there are either 50 or 60 black chips in the urn. The 
chips are thoroughly mixed, and then two chips are withdrawn without replacement. 
The goal is to make an inference about the true number of black chips in the urn, 
having observed the data s = (s1, s2), where s; is the color of the ith chip drawn. 

In this case, we can take the statistical model to be {Pg : 0 €e Q}, where @ is 
the number of black chips in the urn, so that Q = {50, 60}, and Po is the probability 
measure on 

S= {(B, B), (B, W), W, B), W, W)} 


corresponding to 0. Therefore, Ps9 assigns the probability 50 - 49/(100 - 99) to each 
of the sequences (B, B) and (W, W) and the probability 50 - 50/(100 - 99) to each of 
the sequences (B, W) and (W, B), and Peo assigns the probability 60 - 59/(100 - 99) 
to the sequence (B, B), the probability 40 -39/(100 - 99) to the sequence (W, W), and 
the probability 60 - 40/(100 - 99) to each of the sequences (B, W) and (W, B). 

The choice of the parameter is somewhat arbitrary, as we could have easily la- 
belled the possible probability measures as Pı and P2, respectively. The parameter is 
in essence only a label that allows us to distinguish amongst the possible candidates for 
the true probability measure. It is typical, however, to choose this label conveniently 
so that it means something in the problem under discussion. E 


We note some additional terminology in common usage. If a single observed value 
for a response X has the statistical model {fg : 0 € Q}, then a sample (X1, ..., Xn) 
(recall that sample here means that the X; are independent and identically distributed 
— see Definition 2.8.6) has joint density given by fg (x1) fo (x2) --- fo (Xn) for some 
0 e Q. This specifies the statistical model for the response (X1,..., Xn). We refer to 
this as the statistical model for a sample. Of course, the true value of 0 for the statistical 
model for a sample is the same as that for a single observation. Sometimes, rather than 
referring to the statistical model for a sample, we speak of a sample from the statistical 
model { fg : 0 € Q}. 

Note that, wherever possible, we will use uppercase letters to denote an unobserved 
value of a random variable X and lowercase letters to denote the observed value. So an 
observed sample (X1,..., Xn) will be denoted (x1, ...,Xn). 


264 Section 5.3: Statistical Models 


EXAMPLE 5.3.2 

Suppose there are two manufacturing plants for machines. It is known that machines 
built by the first plant have lifelengths distributed Exponential(1), while machines man- 
ufactured by the second plant have lifelengths distributed Exponential(2/3). The den- 
sities of these distributions are depicted in Figure 5.3.1. 


1.0 
f 

0.8 + 
0.6 +. 

+ \ 
o4 > 

N 
IN 
0.2 T N 
ay ~~ 

0.0 + ' — SSS SS 

0 1 2 3 4 5 6 7 

x 


Figure 5.3.1: Plot of the Exponential(1) (solid line) and Exponential (1.5) (dashed line) 
densities. 


You have purchased five of these machines knowing that all five came from the 
same plant, but you do not know which plant. Subsequently, you observe the lifelengths 
of these machines, obtaining the sample (x,,...,.xs), and want to make inferences 
about the true P. 

In this case, the statistical model for a single observation comprises two probability 
measures {P1, P2}, where Pı is the Exponential(1) probability measure and P» is the 
Exponential(2/3) probability measure. Here we take the parameter to be 0 € Q = 
{1,2}. 

Clearly, longer observed lifelengths favor 0 = 2. For example, if 


(x1, ..., X5) = (5.0, 3.5, 3.3, 4.1, 2.8), 
then intuitively we are more certain that 0 = 2 than if 
(x1,...,%5) = (2.0, 2.5, 3.0, 3.1, 1.8). 


The subject of statistical inference is concerned with making statements like this more 
precise and quantifying our uncertainty concerning the validity of such assertions. 

We note again that the quantity @ serves only as a label for the distributions in the 
model. The value of @ has no interpretation other than as a label and we could just 
as easily have used different values for the labels. In many applications, however, the 
parameter @ is taken to be some characteristic of the distribution that takes a unique 


Chapter 5: Statistical Inference 265 


value for each distribution in the model. Here, we could have taken @ to be the mean 
and then the parameter space would be Q = {1, 1.5}. Notice that we could just as well 
have used the first quartile, or for that matter any other quantile, to have labelled the 
distributions, provided that each distribution in the family yields a unique value for the 
characteristic chosen. Generally, any 1—1 transformation of a parameter is acceptable 
as a parameterization of a statistical model. When we relabel, we refer to this as a 
reparameterization of the statistical model. E 


We now consider two important examples of statistical models. These are important 
because they commonly arise in applications. 


EXAMPLE 5.3.3 Bernoulli Model 
Suppose that (x1, ... , Xn) is a sample from a Bernoulli(@) distribution with 8 e [0, 1] 
unknown. We could be observing the results of tossing a coin and recording X; equal 
to 1 whenever a head is observed on the ith toss and equal to 0 otherwise. Alternatively, 
we could be observing items produced in an industrial process and recording X; equal 
to 1 whenever the ith item is defective and 0 otherwise. In a biomedical application, 
the response X; = | might indicate that a treatment on a patient has been successful, 
whereas X; = 0 indicates a failure. In all these cases, we want to know the true value 
of 0, as this tells us something important about the coin we are tossing, the industrial 
process, or the medical treatment, respectively. 

Now suppose we have no information whatsoever about the true probability 0. Ac- 
cordingly, we take the parameter space to be Q = [0, 1], the set of all possible values 
for 9. The probability function for the ith sample item is given by 


fo i) =O"(1 — 0), 


and the probability function for the sample is given by 
n n i 
[] #60 =] [0 a-9'™* =0a -are™, 
i=l i=l 


This specifies the model for a sample. 
Note that we could parameterize this model by any 1—1 function of 8. For example, 
a = 6 would work (as it is 1-1 on Q), as would y = In{@/(1 — @)}. 0 


EXAMPLE 5.3.4 Location-Scale Normal Model 
Suppose that (x1, .. . , Xn) is a sample from an N (u, 07) distribution with 0 = (u, 07) € 
R! x Rt unknown, where R+ = (0, 00). For example, we may have observations of 
heights in centimeters of individuals in a population and feel that it is reasonable to 
assume that the distribution of heights in the population is normal with some unknown 
mean and standard deviation. 

The density for the sample is then given by 


n a ay 7/2 OE n Sasa 
[1 4.09) (i) = (270 ) a PE ra (wi — u) | 


266 Section 5.3: Statistical Models 


because (Problem 5.3.13) 


Soi — WP? =n — 4 e, (5.3.1) 


i=l 


where 


is the sample mean, and 


is the sample variance. 

Alternative parameterizations for this model are commonly used. For example, 
rather than using (u, o7), sometimes (u, o7?) or (u, o) or (u, Inc) is a convenient 
choice. Note that Ino ranges in R! as ø varies in R.U 


Actually, we might wonder how appropriate the model of Example 5.3.4 is for the 
distribution of heights in a population, for in any finite population the true distribution 
is discrete (there are only finitely many students). Of course, a normal distribution 
may provide a good approximation to a discrete distribution, as in Example 4.4.9. So, 
in Example 5.3.4, we are also assuming that a continuous probability distribution can 
provide a close approximation to the true discrete distribution. As it turns out, such 
approximations can lead to great simplifications in the derivation of inferences, so we 
use them whenever feasible. Such an approximation is, of course, not applicable in 
Example 5.3.3. 

Also note that heights will always be expressed in some specific unit, e.g., centime- 
ters; based on this, we know that the population mean must be in a certain range of 
values, e.g., u € (0, 300) , but the statistical model allows for any value for u. So we 
often do have additional information about the true value of the parameter for a model, 
but it is somewhat imprecise, e.g., we also probably have u €e (100, 300). In Chapter 
7, we will discuss ways of incorporating such information into our analysis. 

Where does the model information {Pg : 0 € Q} come from in an application? For 
example, how could we know that heights are approximately normally distributed in 
Example 5.3.4? Sometimes there is such information based upon previous experience 
with related applications, but often it is an assumption that requires checking before 
inference procedures can be used. Procedures designed to check such assumptions are 
referred to as model-checking procedures, which will be discussed in Chapter 9. In 
practice, model-checking procedures are required, or else inferences drawn from the 
data and statistical model can be erroneous if the model is wrong. 


Summary of Section 5.3 


e Ina statistical application, we do not know the distribution of a response, but we 
know (or are willing to assume) that the true probability distribution is one of a 


Chapter 5: Statistical Inference 267 


set of possible distributions { fg : 0 € Q}, where fg is the density or probability 
function (whichever is relevant) for the response. The set of possible distribu- 
tions is called the statistical model. 


e The set Q is called the parameter space, and the variable @ is called the parame- 
ter of the model. Because each value of 0 corresponds to a distinct probability 
distribution in the model, we can talk about the true value of @, as this gives the 
true distribution via fg. 


EXERCISES 


5.3.1 Suppose there are three coins — one is known to be fair, one has probability 1/3 
of yielding a head on a single toss, and one has probability 2/3 for head on a single toss. 
A coin is selected (not randomly) and then tossed five times. The goal is to make an 
inference about which of the coins is being tossed, based on the sample. Fully describe 
a statistical model for a single response and for the sample. 

5.3.2 Suppose that one face of a symmetrical six-sided die is duplicated but we do not 
know which one. We do know that if 1 is duplicated, then 2 does not appear; otherwise, 
1 does not appear. Describe the statistical model for a single roll. 

5.3.3 Suppose we have two populations (I and II) and that variable X is known to be 
distributed N (10, 2) on population I and distributed N (8,3) on population II. A sam- 
ple (X1,..., Xn) is generated from one of the populations; you are not told which 
population the sample came from, but you are required to draw inferences about the 
true distribution based on the sample. Describe the statistical model for this problem. 
Could you parameterize this model by the population mean, by the population vari- 
ance? Sometimes problems like this are called classification problems because making 
inferences about the true distribution is equivalent to classifying the sample as belong- 
ing to one of the populations. 

5.3.4 Suppose the situation is as described in Exercise 5.3.3, but now the distribution 
for population I is N(10, 2) and the distribution for population II is N(10, 3). Could 
you parameterize the model by the population mean? By the population variance? 
Justify your answer. 

5.3.5 Suppose that a manufacturing process produces batteries whose lifelengths are 
known to be exponentially distributed but with the mean of the distribution completely 
unknown. Describe the statistical model for a single observation. Is it possible to 
parameterize this model by the mean? Is it possible to parameterize this model by the 
variance? Is it possible to parameterize this model by the coefficient of variation (the 
coefficient of variation of a distribution equals the standard deviation divided by the 
mean)? 

5.3.6 Suppose it is known that a response X is distributed Uniform[0, 2], where f > 
0 is unknown. Is it possible to parameterize this model by the first quartile of the 
distribution? (The first quartile of the distribution of a random variable X is the point c 
satisfying P(X < c) = 0.25.) Explain why or why not. 

5.3.7 Suppose it is known that a random variable X follows one of the following dis- 
tributions. 


268 Section 5.3: Statistical Models 


OT PR=D [PHD | =D] 
ray im, im | o | 


B] 0 | 2 | 1⁄2 | 


(a) What is the parameter space Q? 


(b) Suppose we observe a value X = 1. What is the true value of the parameter? What 
is the true distribution of X? 

(c) What could you say about the true value of the parameter if you had observed 
X =2? 

5.3.8 Suppose we have a statistical model {P1, P2}, where Pı and P» are probability 
measures on a sample space S. Further suppose there is a subset C C S such that 
P\(C) = 1 while P (C°) = 1. Discuss how you would make an inference about the 
true distribution of a response s after you have observed a single observation. 

5.3.9 Suppose you know that the probability distribution of a variable X is either Pı 
or P2. If you observe X = 1 and P(X = 1) = 0.75 while P (X = 1) = 0.001, 
then what would you guess as the true distribution of X? Give your reasoning for this 
conclusion. 

5.3.10 Suppose you are told that class #1 has 35 males and 65 females while class #2 
has 45 males and 55 females. You are told that a particular student from one of these 
classes is female, but you are not told which class she came from. 

(a) Construct a statistical model for this problem, identifying the parameter, the para- 
meter space, and the family of distributions. Also identify the data. 

(b) Based on the data, do you think a reliable inference is feasible about the true para- 
meter value? Explain why or why not. 

(c) If you had to make a guess about which distribution the data came from, what choice 
would you make? Explain why. 


PROBLEMS 


5.3.11 Suppose in Example 5.3.3 we parameterize the model by wy = In {@/(1 — 0)}. 
Record the statistical model using this parameterization, i.e., record the probability 
function using y as the parameter and record the relevant parameter space. 

5.3.12 Suppose in Example 5.3.4 we parameterize the model by (u, Ina) = (u, y). 
Record the statistical model using this parameterization, i.e., record the density func- 
tion using (u, y) as the parameter and record the relevant parameter space. 

5.3.13 Establish the identity (5.3.1). 

5.3.14 A sample (Xj, ..., Xn) is generated from a Bernoulli(@) distribution with 0 € 
[0, 1] unknown, but only T = Sy Xj; is observed by the statistician. Describe the 
statistical model for the observed data. 

5.3.15 Suppose it is known that a response X is distributed N(u, o°), where 0 = 
(u,07) € R! x Rt and @ is completely unknown. Show how to calculate the first 
quartile of each distribution in this model from the values (u, o°). Is it possible to 
parameterize the model by the first quartile? Explain your answer. 


Chapter 5: Statistical Inference 269 


5.3.16 Suppose response X is known to be distributed N(Y, o”), where Y ~ N(0, 6*) 
and o°, ô > 0 are completely unknown. Describe the statistical model for an obser- 
vation (X, Y). If Y is not observed, describe the statistical model for X. 

5.3.17 Suppose we have a statistical model {P1, P2}, where Pı is an N(10, 1) distrib- 
ution while P» is an N(0, 1) distribution. 

(a) Is it possible to make any kind of reliable inference about the true distribution based 
on a single observation? Why or why not? 

(b) Repeat part (a) but now suppose that P; is an N(1, 1) distribution. 

5.3.18 Suppose we have a statistical model {P1, P2}, where Pı is an N(1, 1) distri- 
bution while P) is an N(0,1) distribution. Further suppose that we had a sample 
X1,...,*100 from the true distribution. Discuss how you might go about making an 
inference about the true distribution based on the sample. 


DISCUSSION TOPICS 


5.3.19 Explain why you think it is important that statisticians state very clearly what 
they are assuming any time they carry out a statistical analysis. 

5.3.20 Consider the statistical model given by the collection of N (u, o?) distributions 
where u € R! is considered completely unknown, but o? is assumed known. Do you 
think this is a reasonable model to use in an application? Give your reasons why or 
why not. 


5.4 | Data Collection 


The developments of Sections 5.2 and 5.3 are based on the observed response s being 
a realization from a probability measure P. In fact, in many applications, this is an 
assumption. We are often presented with data that could have been produced in this 
way, but we cannot always be sure. 

When we cannot be sure that the data were produced by a random mechanism, then 
the statistical analysis of the data is known as an observational study. In an observa- 
tional study, the statistician merely observes the data rather than intervening directly 
in generating the data, to ensure that the randomness assumption holds. For example, 
suppose a professor collects data from his students for a study that examines the rela- 
tionship between grades and part-time employment. Is it reasonable to regard the data 
collected as having come from a probability distribution? If so, how would we justify 
this? 

It is important for a statistician to distinguish carefully between situations that are 
observational studies and those that are not. As the following discussion illustrates, 
there are qualifications that must be applied to the analysis of an observational study. 
While statistical analyses of observational studies are valid and indeed important, we 
must be aware of their limitations when interpreting their results. 


270 Section 5.4: Data Collection 


5.4.1 | Finite Populations 


Suppose we have a finite set II of objects, called the population, and a real-valued 
function X (sometimes called a measurement) defined on IT. So for each z e II, we 
have a real-valued quantity X (m ) that measures some aspect or feature of z. 

For example, II could be the set of all students currently enrolled full-time at a 
particular university, with X(x) the height of student z in centimeters. Or, for the 
same ITI, we could take X(z) to be the gender of student m, where X(z) = 1 denotes 
female and X (z) = 2 denotes male. Here, height is a quantitative variable, because its 
values mean something on a numerical scale, and we can perform arithmetic on these 
values, e.g., calculate a mean. On the other hand, gender is an example of a categorical 
variable because its values serve only to classify, and any other choice of unique real 
numbers would have served as well as the ones we chose. The first step in a statistical 
analysis is to determine the types of variables we are working with because the relevant 
statistical analysis techniques depend on this. 

The population and the measurement together produce a population distribution 
over the population. This is specified by the population cumulative distribution func- 
tion Fy : R! —> [0, 1], where 


Itz: X(z) < x}I 
N $ 


with |4| being the number of elements in the set A, and N = |II|. Therefore, Fy (x) 
is the proportion of elements in IT with their measurement less than or equal to x. 
Consider the following simple example where we can calculate Fy exactly. 


EXAMPLE 5.4.1 

Suppose that II is a population of N = 20 plots of land of the same size. Further 
suppose that X (m) is a measure of the fertility of plot z on a 10-point scale and that 
the following measurements were obtained. 


Fy(x) = 


48 678 3 7 5 4 6 
9575 8 3 47 8 3 
Then we have 

0 x <3 
3/20 3<x <4 
6/20 4<x <5 
9/20 5<x <6 

Fx) =) 11/20 6<x <7 
15/20 T<x <8 
19/20 8<x <9 
1 9<x 


because, for example, 6 out of the 20 plots have fertility measurements less than or 
equal to 4. E 


The goal of a statistician in this context is to know the function Fy as precisely 
as possible. If we know Fy exactly, then we have identified the distribution of X 


Chapter 5: Statistical Inference 271 


over II. One way of knowing the distribution exactly is to conduct a census, wherein, 
the statistician goes out and observes X(z) for every m €e II and then calculates Fy. 
Sometimes this is feasible, but often it is not possible or even desirable, due to the costs 
involved in the accurate accumulation of all the measurements — think of how difficult 
it might be to collect the heights of all the students at your school. 

While sometimes a census is necessary, even mandated by law, often a very accu- 
rate approximation to Fy can be obtained by selecting a subset 


{m1,...,@n} CH 


for some n < N. We then approximate Fy(x) by the empirical distribution function 
defined by 


Hri: Xm) <x,i=1,...,n}I 
n 


Fy(x) = 
1 n 
a > I—o,x] (X(ai))- 
i=1 


We could also measure more than one aspect of z to produce a multivariate mea- 
surement X : TI > R* for some k. For example, if TI is again the population of 
students, we might have X (m) = (X1(z), X2(z)), where X; (x) is the height in cen- 
timeters of student z and X>(z) is the weight of student z in kilograms. We will dis- 
cuss multivariate measurements in Chapter 10, where our concern is the relationships 
amongst variables, but we focus on univariate measurements here. 

There are two questions we need to answer now — namely, how should we select 
the subset {71,..., Zn} and how large should n be? 


5.4.2 | Simple Random Sampling 


We will first address the issue of selecting {71,..., Zn}. Suppose we select this subset 
according to some given rule based on the unique label that each m € II possesses. 
For example, if the label is a number, we might order the numbers and then take the n 
elements with the smallest labels. Or we could order the numbers and take every other 
element until we have a subset of n, etc. 

There are many such rules we could apply, and there is a basic problem with all 
of them. If we want Ê y to approximate Fy for the full population, then, when we 
employ a rule, we run the risk of only selecting {71,..., Zn} from a subpopulation. 
For example, if we use student numbers to identify each element of a population of 
students, and more senior students have lower student numbers, then, when n is much 
smaller than N and we select the students with smallest student numbers, Ê 'y is really 
only approximating the distribution of X in the population of senior students at best. 
This distribution could be very different from Fy. Similarly, for any other rule we 
employ, even if we cannot imagine what the subpopulation could be, there may be 
such a selection effect, or bias, induced that renders the estimate invalid. 

This is the qualification we need to apply when analyzing the results of observa- 
tional studies. In an observational study, the data are generated by some rule, typically 


272 Section 5.4: Data Collection 


unknown to the statistician; this means that any conclusions drawn based on the data 
X(m1),,-.-,X (an) may not be valid for the full population. 

There seems to be only one way to guarantee that selection effects are avoided, 
namely, the set {71,..., Zn} must be selected using randomness. For simple random 
sampling, this means that a random mechanism is used to select the mæ; in such a way 
that each subset of n has probability 1/ W ) of being chosen. For example, we might 
place N chips in a bowl, each with a unique label corresponding to a population ele- 
ment, and then randomly draw n chips from the bowl without replacement. The labels 
on the drawn chips identify the individuals that have been selected from II. Alterna- 
tively, for the randomization, we might use a table of random numbers, such as Table 
D.1 in Appendix D (see Table D.1 for a description of how it is used) or generate 
random values using a computer algorithm (see Section 2.10). 

Note that with simple random sampling (X (z1), ,..., X(an)) is random. In par- 
ticular, when n = 1, we then have 


P(X(z1) < x) = Fx(x), 


namely, the probability distribution of the random variable X (m1) is the same as the 
population distribution. 


EXAMPLE 5.4.2 
Consider the context of Example 5.4.1. When we randomly select the first plot from 
II, it is clear that each plot has probability 1/20 of being selected. Then we have 


I{z : X(x) <x} 


P(X(a1) <x) = 0 


= Fy(x) 


for every x € RLU 


Prior to observing the sample, we also have P(X(z2) < x) = F(x). Consider, 
however, the distribution of X (72) given that X(z 1) = xı. Because we have removed 
one population member, with measurement value x1, then NF'y(x) — 1 is the number 
of individuals left in II with X (æ) < xı. Therefore, 


Wego x>x 
P(X(m2) < x |X (r1) =x1) = a iz 
ee x <x. 
Note that this is not equal to Fy (x). 
So with simple random sampling, X (a1) and X (2) are not independent. Observe, 
however, that when N is large, then 


P(X(x2) < x |X (@1) = x1) © Fr), 


so that X (z1) and X (z2) are approximately independent and identically distributed 
(i.i.d.). Similar calculations lead to the conclusion that, when N is large and n is small 
relative to N, then with simple random sampling from the population, the random 
variables 

X(x1),...,X (tn) 


Chapter 5: Statistical Inference 273 


are approximately i.i.d. and with distribution given by Fy. So we will treat the observed 
values (x1,...,Xn) of (X(a1),...,X(an)) as a sample (in the sense of Definition 
2.8.6) from Fy. In this text, unless we indicate otherwise, we will always assume that 
n is small relative to N so that this approximation makes sense. 

Under the i.i.d. assumption, the weak law of large numbers (Theorem 4.2.1) implies 
that the empirical distribution function Fy satisfies 


A 1X 
Pree) =- D Icon KD) S Fra) 


i=l 


as n —> oo. So we see that Êy can be considered as an estimate of the population 
cumulative distribution function (cdf) Fy. 

Whenever the data have been collected using simple random sampling, we will re- 
fer to the statistical investigation as a sampling study. It is a basic principle of good 
statistical practice that sampling studies are always preferred over observational stud- 
ies, whenever they are feasible. This is because we can be sure that, with a sampling 
study, any conclusions we draw based on the sample 21, ..., mæn will apply to the pop- 
ulation II of interest. With observational studies, we can never be sure that the sample 
data have not actually been selected from some proper subset of II. For example, if you 
were asked to make inferences about the distribution of student heights at your school 
but selected some of your friends as your sample, then it is clear that the estimated cdf 
may be very unlike the true cdf (possibly more of your friends are of one gender than 
the other). 

Often, however, we have no choice but to use observational data for a statistical 
analysis. Sampling directly from the population of interest may be extremely difficult 
or even impossible. We can still treat the results of such analyses as a form of evidence, 
but we must be wary about possible selection effects and acknowledge this possibility. 
Sampling studies constitute a higher level of statistical evidence than observational 
studies, as they avoid the possibility of selection effects. 

In Chapter 10, we will discuss experiments that constitute the highest level of sta- 
tistical evidence. Experiments are appropriate when we are investigating the possibility 
of cause-effect relationships existing amongst variables defined on populations. 

The second question we need to address concerns the choice of the sample size n. It 
seems natural that we would like to choose as large a sample as possible. On the other 
hand, there are always costs associated with sampling, and sometimes each sample 
value is very expensive to obtain. Furthermore, often the more data we collect, the 
more difficulty we have in making sure that the data are not corrupted by various errors 
that can arise in the collection process. So our answer, concerning how large n need be, 
is that we want it chosen large enough so that we obtain the accuracy necessary but no 
larger. Accordingly, the statistician must specify what accuracy is required, and then n 
is determined. 

We will see in the subsequent chapters that there are various methods for specifying 
the required accuracy in a problem and then determining an appropriate value for n. 
Determining n is a key component in the implementation of a sampling study and is 
often referred to as a sample-size calculation. 


274 Section 5.4: Data Collection 


If we define 


N = 1 
a aaa DAE 


namely, fy(x) is the proportion of population members satisfying X (z) = x, then we 
see that fy plays the role of the probability function because 


Fx) = >) O. 


Z<x 


We refer to fy as the population relative frequency function. Now, fx(x) may be 
estimated, based on the sample {z1,..., a}, by 


x pix i) = eee een 1 K 
fx(x) = Wai Xi) =x,i =1,...,n} a . S Iwa), 
i=l 


n 


namely, the proportion of sample members z satisfying X (z) = x. 

With categorical variables, Ê y(x) estimates the population proportion fy(x) in 
the category specified by x. With some quantitative variables, however, fy is not an 
appropriate quantity to estimate, and an alternative function must be considered. 


5.4.3 | Histograms 


Quantitative variables can be further classified as either discrete or continuous vari- 
ables. Continuous variables are those that we can measure to an arbitrary precision as 
we increase the accuracy of a measuring instrument. For example, the height of an 
individual could be considered a continuous variable, whereas the number of years of 
education an individual possesses would be considered a discrete quantitative variable. 
For discrete quantitative variables, fy is an appropriate quantity to describe a popula- 
tion distribution, but we proceed differently with continuous quantitative variables. 

Suppose that X is a continuous quantitative variable. In this case, it makes more 
sense to group values into intervals, given by 


(hi, hy], (ho, h3], HEAS (hm-1, hm], 


where the h; are chosen to satisfy hy < hy < --- < hm with (hi, hm) effectively 
covering the range of possible values for X. Then we define 


[Ge Xei hess} = 
hx(x) = | 0 N(hi41—hi) xE (hi, hi+1] 


otherwise 


and refer to hx as a density histogram function. Here, hx(x) is the proportion of 
population elements z that have their measurement X (x) in the interval (h;i, hi+1] 
containing x, divided by the length of the interval. 

In Figure 5.4.1, we have plotted a density histogram based on a sample of 10,000 
from an N(0, 1) distribution (we are treating this sample as the full population) and 


Chapter 5: Statistical Inference 275 


using the values hy = —5, h2 = —4,..., 41, = 5. Note that the vertical lines are only 
artifacts of the plotting software and do not represent values of h y, as these are given 
by the horizontal lines. 


0.44 


0.35 


0.15 


0.0 — | [| | 
4 -2 0 2 4 
x 


Figure 5.4.1: Density histogram function for a sample of 10,000 from an N (0, 1) distribution 
using the values y= —5, ho = —4, ... , h11= 5. 


Ifx e (hi, hj41], then A y(x) (hj41 — hi) gives the proportion of individuals in the 
population that have their measurement X (z) in (h;, hi+1]. Furthermore, we have 


hj 
Fy(hj) -[ hx(x)dx 


for each interval endpoint and 


hj 
Fy (hj) — Fy (hi) =f hx(x)dx 


i 


when h; < hj. If the intervals (h;, Ai+1] are small, then we expect that 


b 
Fy(b) = Fy (a) x 1 hx(x) dx 
a 
for any choice ofa < b. 
Now suppose that the lengths 4;41; — h; are small and N is very large. Then it 


makes sense to imagine a smooth, continuous function fy, e.g., perhaps a normal or 
gamma density function, that approximates / y in the sense that 


f ” facbs) dx f i 


for every a < b. Then we will also have 


b 
I fx&) dx x Fy(b) e Fy(a) 


276 Section 5.4: Data Collection 


for every a < b. We will refer to such an fy as a density function for the population 
distribution. 

In essence, this is how many continuous distributions arise in practice. In Figure 
5.4.2, we have plotted a density histogram for the same values used in Figure 5.4.1, but 
this time we used the interval endpoints hı = —5, h2 = —4.75,...,h4, = 5. We note 
that Figure 5.4.2 looks much more like a continuous function than does Figure 5.4.1. 


0.4- 


0.34 


Density 


0.24 


0:14 


0.0 T T 
-4.5 -3.0 “1.5 0.0 1.5 3.0 4.5 


Figure 5.4.2: Density histogram function for a sample of 10,000 from an N (0, 1) distribution 
using the values hy = —5, h2 = —4.75, ..., h41 =5. 


5.4.4 | Survey Sampling 


Finite population sampling provides the formulation for a very important application 
of statistics, namely, survey sampling or polling. Typically, a survey consists of a set of 
questions that are asked of a sample {71,..., Zn} from a population II. Each question 
corresponds to a measurement, so if there are m questions, the response from a respon- 
dent z is the m-dimensional vector (X1(z), X2(z),..., Xm(a)). A very important 
example of survey sampling is the pre-election polling that is undertaken to predict the 
outcome of a vote. Also, many consumer product companies engage in extensive mar- 
ket surveys to try to learn what consumers want and so gain information that can lead 
to improved sales. 

Typically, the analysis of the results will be concerned not only with the population 
distributions of the individual X; over the population but also the joint population dis- 
tributions. For example, the joint cumulative distribution function of (X1, X2) is given 
by 
I{z : Xı (æ) < x1, X2(z) < Xo} 
———— e 


namely, F(x, X% (1 , x2) is the proportion of the individuals in the population whose 
Xı measurement is no greater than x; and whose X2 measurement is no greater than 
x2. Of course, we can also define the joint distributions of three or more measurements. 


FX, X) 1, x2) = 


Chapter 5: Statistical Inference 277 


These joint distributions are what we use to answer questions like, is there a relationship 
between X and X2, and if so,what form does it take? This topic will be extensively 
discussed in Chapter 10. We can also define f(x,,x,) for the joint distribution, and joint 
density histograms are again useful when X and X2 are both continuous quantitative 
variables. 


EXAMPLE 5.4.3 

Suppose there are four candidates running for mayor in a particular city. A random 
sample of 1000 voters is selected; they are each asked if they will vote and, if so, 
which of the four candidates they will vote for. Additionally, the respondents are asked 
their age. We denote the answer to the question of whether or not they will vote by X1, 
with X; (az) = 1 meaning yes and X)(z) = 0 meaning no. For those voting, we denote 
by X the response concerning which candidate they will vote for, with X2(7) = i 
indicating candidate 7. Finally, the age in years of the respondent is denoted by X3. In 
addition to the distributions of X; and X2, the pollster is also interested in the joint 
distributions of (X1, X3) and (X2, X3), as these tell us about the relationship between 
voter participation and age in the first case and candidate choice and age in the second 
case. ll 


There are many interesting and important aspects to survey sampling that go well 
beyond this book. For example, it is often the case with human populations that a ran- 
domly selected person will not respond to a survey. This is called nonresponse error, 
and it is a serious selection effect. The sampler must design the study carefully to try 
to mitigate the effects of nonresponse error. Furthermore, there are variants of simple 
random sampling (see Challenge 5.4.20) that can be preferable in certain contexts, as 
these increase the accuracy of the results. The design of the actual questionnaire used 
is also very important, as we must ensure that responses address the issues intended 
without biasing the results. 


Summary of Section 5.4 


e Simple random sampling from a population II means that we randomly select 
a subset of size n from II in such a way that each subset of n has the same 
probability — namely, 1/(!"!') — of being selected. 


e Data that arise from a sampling study are generated from the distribution of the 
measurement of interest X over the whole population II rather than some sub- 
population. This is why sampling studies are preferred to observational studies. 


e When the sample size n is small relative to |IT| , we can treat the observed values 
of X as a sample from the distribution of X over the population. 


EXERCISES 


5.4.1 Suppose we have a population I] = {z1,..., Z 10} and quantitative measurement 
X given by: 


278 Section 5.4: Data Collection 


Calculate Fy, fy, uy, and ar. 

5.4.2 Suppose you take a sample of n = 3 (without replacement) from the population 
in Exercise 5.4.1. 

(a) Can you consider this as an approximate i.i.d. sample from the population distribu- 
tion? Why or why not? 

(b) Explain how you would actually physically carry out the sampling from the popu- 
lation in this case. (Hint: Table D.1.) 


(c) Using the method you outlined in part (b), generate three samples of size n = 3 and 
calculate X for each sample. 


5.4.3 Suppose you take a sample of n = 4 (with replacement) from the population in 
Exercise 5.4.1. 

(a) Can you consider this as an approximate i.i.d. sample from the population distribu- 
tion? Why or why not? 

(b) Explain how you would actually physically carry out the sampling in this case. 


(c) Using the method you outlined in part (b), generate three samples of size n = 3 and 
calculate X for each sample. 

5.4.4 Suppose we have a finite population II and a measurement X : II — {0,1} 
where |I| = N and |{z : X(z) = 0}| =a. 

(a) Determine fy(0) and fy(1). Can you identify this population distribution? 

(b) For a simple random sample of size n, determine the probability that n Ê x0) = x. 
(c) Under the assumption of i.i.d. sampling, determine the probability that n fy (0) = x. 
5.4.5 Suppose the following sample of size of n = 20 is obtained from an industrial 
process. 


3.9 7.2 69 45 58 3.7 44 45 56 2.5 


48 85 43 12 23 31 34 48 18 3.7 


(a) Construct a density histogram for this data set using the intervals (1, 4.5], (4.5, 5.5], 
(5.5, 6.5](6.5, 10]. 

(b) Construct a density histogram for this data set using the intervals (1, 3.5], (3.5, 4.5], 
(4.5, 6.5], (6.5, 10]. 

(c) Based on the results of parts (a) and (b), what do you conclude about histograms? 
5.4.6 Suppose it is known that in a population of 1000 students, 350 students will vote 
for party A, 550 students will vote for party B, and the remaining students will vote 
for party C. 

(a) Explain how such information can be obtained. 


(b) If we let X : II > {A, B, C} be such that X (z) is the party that z will vote for, 
then explain why we cannot represent the population distribution of X by Fy. 


(c) Compute fy. 
(d) Explain how one might go about estimating fy prior to the election. 


(e) What is unrealistic about the population distribution specified via fy? (Hint: Does 
it seem realistic, based on what you know about voting behavior?) 


Chapter 5: Statistical Inference 279 


5.4.7 Consider the population IT to be files stored on a computer at a particular time. 
Suppose that X(z) is the type of file as indicated by its extension, e.g., .mp3. Is X a 
categorical or quantitative variable? 

5.4.8 Suppose that you are asked to estimate the proportion of students in a college of 
15, 000 students who intend to work during the summer. 

(a) Identify the population II, the variable X, and fy. What kind of variable is X? 

(b) How could you determine fy exactly? 

(c) Why might you not be able to determine fy exactly? Propose a procedure for 
estimating fy in such a situation. 

(d) Suppose you were also asked to estimate the proportion of students who intended 
to work but could not find a job. Repeat parts (a), (b), and (c) for this situation. 

5.4.9 Sometimes participants in a poll do not respond truthfully to a question. For 
example, students who are asked “Have you ever illegally downloaded music?” may 
not respond truthfully even if they are assured that their responses are confidential. 
Suppose a simple random sample of students was chosen from a college and students 
were asked this question. 

(a) If students were asked this question by a person, comment on how you think the 
results of the sampling study would be affected. 

(b) If students were allowed to respond anonymously, perhaps by mailing in a ques- 
tionnaire, comment on how you think the results would be affected. 

(c) One technique for dealing with the respondent bias induced by such questions is 
to have students respond truthfully only when a certain random event occurs. For 
example, we might ask a student to toss a fair coin three times and lie whenever they 
obtain two heads. What is the probability that a student tells the truth? Once you have 
completed the study and have recorded the proportion of students who said they did 
cheat, what proportion would you record as your estimate of the proportion of students 
who actually did cheat? 

5.4.10 A market research company is asked to determine how satisfied owners are with 
their purchase of a new car in the last 6 months. Satisfaction is to be measured by re- 
spondents choosing a point on a seven-point scale {1, 2, 3,4, 5, 6, 7}, where 1 denotes 
completely dissatisfied and 7 denotes completely satisfied (such a scale is commonly 
called a Likert scale). 

(a) Identify II, the variable X, and fy. 


(b) It is common to treat a variable such as X as a quantitative variable. Do you think 
this is correct? Would it be correct to treat X as a categorical variable? 


(c) A common criticism of using such a scale is that the interpretation of a statement 
such as 3 = “I’m somewhat dissatisfied” varies from one person to another. Comment 
on how this affects the validity of the study. 


COMPUTER EXERCISES 


5.4.11 Generate a sample of 1000 from an N(3, 2) distribution. 
(a) Calculate Êy for this sample. 


280 Section 5.4: Data Collection 


(b) Plot a density histogram based on these data using the intervals of length 1 over the 
range (—5, 10). 

(c) Plot a density histogram based on these data using the intervals of length 0.1 over 
the range (—S, 10). 

(d) Comment on the difference in the look of the histograms in parts (b) and (c). To 
what do you attribute this? 

(e) What limits the size of the intervals we use to group observations when we are 
plotting histograms? 

5.4.12 Suppose we have a population of 10,000 elements, each with a unique label 
from the set {1, 2,3,..., 10, 000}. 

(a) Generate a sample of 500 labels from this population using simple random sam- 
pling. 

(b) Generate a sample of 500 labels from this population using i.i.d. sampling. 


PROBLEMS 


5.4.13 Suppose we have a finite population II and a measurement X : II > {0, 1, 2}, 
where |I| = N and |{z : X(x) = 0}| = a and [{z : X(z) = 1}| = b. This problem 
generalizes Exercise 5.4.4. 

(a) Determine fy (0), fx(1), and fy (2). 

(b) For a simple random sample of size n, determine the probability that Ê x(0) = 
fo, fxQ) = fi, and fx@) = fr. 

(c) Under the assumption of i.i.d. sampling, determine the probability that fi x(0) = 
fo, fxQ) = fi, and fx) = fr. 

5.4.14 Suppose X is a quantitative measurement defined on a finite population. 

(a) Prove that the population mean equals u y = >”. xfx(x), i.e., the average of X (m) 
over all population elements z equals u y. 

(b) Prove that the population variance is given by oo =>,@-u x) fx(x), i.e., the 
average of (X(x) — u y)? over all population elements z equals o Z 

5.4.15 Suppose we have the situation described in Exercise 5.4.4, and we take a simple 
random sample of size n from II where |II| = N. 

(a) Prove that the mean of Ê x(0) is given by fx(0). (Hint: Note that we can write 
fx (0) = a7! Xia Noy (Xi) and Ho) (X(w;)) ~ Bemoulli(fx(0)).) 

(b) Prove that the variance of fy(0) is given by 


fx) Ud = fx0)) N -n 

n N-1- 
(Hint: Use the hint in part (a), but note that the Jo} (X (z;)) are not independent. Use 
Theorem 3.3.4(b) and evaluate Cov (Ko (X(r:i)), Lo} (X (z:))) in terms of fy(0).) 
(c) Repeat the calculations in parts (a) and (b), but this time assume that you take a 
sample of n with replacement. (Hint: Use Exercise 5.4.4(c).) 
(d) Explain why the factor (N — n)/(N — 1) in (5.4.1) is called the finite population 
correction factor. 


(5.4.1) 


Chapter 5: Statistical Inference 281 


5.4.16 Suppose we have a finite population II and we do not know |II| = N. In 
addition, suppose we have a measurement variable X : II — {0, 1} and we know 
that N fy(0) = a where a is known. Based on a simple random sample of n from II, 
determine an estimator of N. (Hint: Use a function of 7 “x (0).) 

5.4.17 Suppose that X is a quantitative variable defined on a population II and that we 
take a simple random sample of size n from IT. 

(a) If we estimate the population mean yy by the sample mean Y = 1 X; X (a7), 
prove that E(X) = uy where yy is defined in Problem 5.4.14(a). (Hint: What is the 
distribution of each X (z;)?) 

(b) Under the assumption that i.i.d. sampling makes sense, show that the variance of X 
equals oy /n, where oy is defined in Problem 5.4.14(b). 

5.4.18 Suppose we have a finite population II and we do not know |I| = N. In 
addition, suppose we have a measurement variable X : II > R! and we know T = 
>. X (z) . Based on a simple random sample of n from IT, determine an estimator of 
N. (Hint: Use a function of X.) 


5.4.19 Under i.i.d. sampling, prove that fr) E fx(x) asin — œ. (Hint: f(x) = 
no Sia loa) 


CHALLENGES 


5.4.20 (Stratified sampling) Suppose that X is a quantitative variable defined on a pop- 
ulation II and that we can partition IT into two subpopulations IT; and I2, such that a 
proportion p of the full population is in I1. Let f; x denote the conditional population 
distribution of X on TI;. 


(a) Prove that fy(x) = pfix@) + — p) fox). 

(b) Establish that wy = puyy + (1 — p) uzy, where py; y is the mean of X on Tj. 

(c) Establish that 04, = poty + (1 — p)o3y+p(l — p) (uix — Hox). 

(d) Suppose that it makes sense to assume i.i.d. sampling whenever we take a sample 
from either the full population or either of the subpopulations, i.e., whenever the sam- 
ple sizes we are considering are small relative to the sizes of these populations. We 
implement stratified sampling by taking a simple random sample of size n; from sub- 
population II;. We then estimate uy by pX, + (1 — p) X2, where X; is the sample 
mean based on the sample from II;. Prove that E (pX; + (1 — p) X2) = my and 


7 y 201x 292x 
Var (pX, + (1 — p) X2) = p*—= + (1 — py —. 
hi ni 
(e) Under the assumptions of part (d), prove that 
Var (pX1 + (1 — p) X2) < Var (X) 


when X is based on a simple random sample of size n from the full population and 
nı = pn,m = (1 — p)n. This is called proportional stratified sampling. 

(f) Under what conditions is there no benefit to proportional stratified sampling? What 
do you conclude about situations in which stratified sampling will be most beneficial? 


282 Section 5.5: Some Basic Inferences 


DISCUSSION TOPICS 


5.4.21 Sometimes it is argued that it is possible for a skilled practitioner to pick a more 
accurate representative sample of a population deterministically rather than by employ- 
ing simple random sampling. This argument is based in part on the argument that it is 
always possible with simple random sampling that we could get a very unrepresenta- 
tive sample through pure chance and that this can be avoided by an expert. Comment 
on this assertion. 

5.4.22 Suppose it is claimed that a quantitative measurement X defined on a finite 
population IT is approximately distributed according to a normal distribution with un- 
known mean and unknown variance. Explain fully what this claim means. 


5.5 | Some Basic Inferences 


Now suppose we are in a situation involving a measurement X, whose distribution is 
unknown, and we have obtained the data (x1, x2,...,X,), 1.€., observed n values of X. 
Hopefully, these data were the result of simple random sampling, but perhaps they were 
collected as part of an observational study. Denote the associated unknown population 
relative frequency function, or an approximating density, by fy and the population 
distribution function by Fy. 

What we do now with the data depends on two things. First, we have to determine 
what we want to know about the underlying population distribution. Typically, our 
interest is in only a few characteristics of this distribution — the mean and variance. 
Second, we have to use statistical theory to combine the data with the statistical model 
to make inferences about the characteristics of interest. 

We now discuss some typical characteristics of interest and present some informal 
estimation methods for these characteristics, known as descriptive statistics. These 
are often used as a preliminary step before more formal inferences are drawn and are 
justified on simple intuitive grounds. They are called descriptive because they are 
estimating quantities that describe features of the underlying distribution. 


5.5.1 | Descriptive Statistics 


Statisticians often focus on various characteristics of distributions. We present some of 
these in the following examples. 


EXAMPLE 5.5.1 Estimating Proportions and Cumulative Proportions 

Often we want to make inferences about the value fy(x) or the value Fy(x) for a 
specific x. Recall that fy(x) is the proportion of population members whose X mea- 
surement equals x. In general, Fy(x) is the proportion of population members whose 
X measurement is less than or equal to x. 

Now suppose we have a sample (x1,x2,...,Xn) from fy. A natural estimate of 
fx(x) is given by f x(x), the proportion of sample values equal to x. A natural estimate 
of Fy (x) is given by Fy(x) =n! $; L(—co,x] (41) , the proportion of sample values 
less than or equal to x, otherwise known as the empirical distribution function evaluated 
atx. 


Chapter 5: Statistical Inference 283 


Suppose we obtained the following sample of n = 10 data values. 


12 21 04 33 -2.1 40 —0.3 2.2 41.5 5.0 


In this case, fy(x) = 0.1 whenever x is a data value and is 0 otherwise. To compute 
Ê y(x), we simply count how many sample values are less than or equal to x and 
divide by n = 10. For example, Fy(—3) = 0/10 = 0, Fy(0) = 2/10 = 0.2, and 
Fy(4) =9/10 = 0.9.8 


An important class of characteristics of the distribution of a quantitative variable X 
is given by the following definition. 


Definition 5.5.1 For p € [0,1], the pth quantile (or 100 pth percentile) xp, for 
the distribution with cdf Fy, is defined to be the smallest number xp satisfying 


p < Fx(xp). 


For example, if your mark on a test placed you at the 90th percentile, then your mark 
equals xo, and 90% of your fellow test takers achieved your mark or lower. Note that 
by the definition of the inverse cumulative distribution function (Definition 2.10.1), we 
can write xp = Fy'(p) =min{x: p < Fy(x)}. 

When FYy is strictly increasing and continuous, then Fy ! (p) is the unique value xp 
satisfying 


Fx(xp) = P. (5.5.1) 


Figure 5.5.1 illustrates the situation in which there is a unique solution to (5.5.1). When 
Fy is not strictly increasing or continuous (as when X is discrete), then there may be 
more than one, or no, solutions to (5.5.1). Figure 5.5.2 illustrates the situation in which 
there is no solution to (5.5.1). 


p xX 


Figure 5.5.1: The pth quantile xp when there is a unique solution to (5.5.1). 


284 Section 5.5: Some Basic Inferences 


3y 


Figure 5.5.2: The pth quantile x, determined by a cdf Fy when there is no solution to 
(5.5.1). 


So, when X is a continuous measurement, a proportion p of the population have 
their X measurement less than or equal to xp. As particular cases, x95 = F 7 | (0.5) 
is the median, while xo.25 = F7! (0.25) and x0.75 = ES (0.75) are the first and third 
quartiles, respectively, of the distribution. 


EXAMPLE 5.5.2 Estimating Quantiles 

A natural estimate of a population quantile x, = Fy l (p) is to use Xp) = Ê Z : (p). Note, 

however, that Êy is not continuous, so there may not be a solution to (5.5.1) using Fy. 

Applying Definition 5.5.1, however, leads to the following estimate. First, order the 

observed sample values (x1, . . . , Xn) to obtain the order statistics x(1) < +++ < X(n) (see 

Section 2.8.4). Then, note that xç) is the (i /n)-th quantile of the empirical distribution, 
because ; 
A l 

Fx) = P 

and Ê x(x) < i/n whenever x < xq). In general, we have that the sample pth quantile 

is ĉp = xq) whenever 
i-l 
n 


epee, (5.5.2) 
n 


A number of modifications to this estimate are sometimes used. For example, if we 
find į such that (5.5.2) is satisfied and put 


- i—l 
Xp =Xi-1) +N ka —x(-1) ( — 5 ). (5.5.3) 
then x, is the linear interpolation between xq_1) and xq). When n is even, this defin- 
ition gives the sample median as ¥o.5 = X(n/2); a similar formula holds when n is odd 
(Problem 5.5.21). Also see Problem 5.5.22 for more discussion of (5.5.3). 
Quite often the sample median is defined to be 


X((n+1)/2) n odd 
žos = (5.5.4) 
3 (x(n/2) + X~nj241)) n even, 


Chapter 5: Statistical Inference 285 


namely, the middle value when n is odd and the average of the two middle values when 
n is even. For n large enough, all these definitions will yield similar answers. The use 
of any of these is permissible in an application. 

Consider the data in Example 5.5.1. Sorting the data from smallest to largest, the 
order statistics are given by the following table. 


xa) =-—2.1 x@ =—0.3 x8) =04 xa =12 x65) = 1.5 


x6 = 2.1 X(n = 2.2 X(g) = 3.3. xo =4.0 xao = 5.0 


Then, using (5.5.3), the sample median is given by ¥o.5 = x(5) = 1.5, while the sample 
quartiles are given by 
Xo25 = x@ +10 (x) — xo) (0.25 — 0.2) 
—0.3 + 10 (0.4 — (—0.3)) (0.25 — 0.2) = 0.05 


and 


X075 = xm+10 (x(8) — x(7)) (0.75 — 0.7) 
2.2 + 10 (3.3 — 2.2) (0.75 — 0.7) = 2.75. 
So in this case, we estimate that 25% of the population under study has an X measure- 
ment less than 0.05, etc. E 


EXAMPLE 5.5.3 Measuring Location and Scale of a Population Distribution 
Often we are asked to make inferences about the value of the population mean 


1 
ex = 2, X) 
HI fa 
and the population variance 
1 
az= D Xa) - ay), 
1 mell 


where II is a finite population and X is a real-valued measurement defined on it. These 
are measures of the location and spread of the population distribution about the mean, 
respectively. Note that calculating a mean or variance makes sense only when X is a 
quantitative variable. 

When X is discrete, we can also write 


ux = > xfx() 


because |I| fx(x) equals the number of elements r €e II with X(z) = x. In the 
continuous case, using an approximating density fy, we can write 


py f cfu) ae. 


Similar formulas exist for the population variance of X (see Problem 5.4.14). 


286 Section 5.5: Some Basic Inferences 


It will probably occur to you that a natural estimate of the population mean x y is 
given by the sample mean 


s = > e-a. (5.5.5) 


Later we will explain why we divided by n — 1 in (5.5.5) rather than n. Actually, it 
makes little difference which we use, for even modest values of n. The sample standard 
deviation is given by s, the positive square root of s?. For the data in Example 5.1.1, 
we obtain x = 1.73 and s = 2.097. 

The population mean u y and population standard deviation o y serve as a pair, in 
which u y measures where the distribution is located on the real line and o y measures 
how much spread there is in the distribution about u y. Clearly, the greater the value of 
o x, the more variability there is in the distribution. 

Alternatively, we could use the population median xo,5 as a measure of location 
of the distribution and the population interquartile range xo,75 — X0.25 aS a Measure 
of the amount of variability in the distribution around the median. The median and 
interquartile range are the preferred choice to measure these aspects of the distribution 
whenever the distribution is skewed, i.e., not symmetrical. This is because the median 
is insensitive to very extreme values, while the mean is not. For example, house prices 
in an area are well known to exhibit a right-skewed distribution. A few houses selling 
for very high prices will not change the median price but could result in a big change 
in the mean price. 

When we have a symmetric distribution, the mean and median will agree (provided 
the mean exists). The greater the skewness in a distribution, however, the greater will 
be the discrepancy between its mean and median. For example, in Figure 5.5.3 we have 
plotted the density ofa x (4) distribution. This distribution is skewed to the right, and 
the mean is 4 while the median is 3.3567. 


f 9-03 7 


0.02 7 


0.00 


Figure 5.5.3: The density f ofa y? (4) distribution. 


Chapter 5: Statistical Inference 287 


We estimate the population interquartile range by the sample interquartile range 
(IQR) given by IOR = Xo0.75 — Xo.25. For the data in Example 5.5.1, we obtain the 
sample median to be xo.5 = 1.5, while IOR = 2.75 — 0.05 = 2.70. 

If we change the largest value in the sample from x(19) = 5.0 to x(19) = 500.0, the 
sample median remains ¥o.5 = 1.5, but note that the sample mean goes from 1.73 to 
51.23! 0 


5.5.2 | Plotting Data 


It is always a good idea to plot the data. For discrete quantitative variables, we can plot 
fx, i.e., plot the sample proportions (relative frequencies). For continuous quantitative 
variables, we introduced the density histogram in section 5.4.3. These plots give us 
some idea of the shape of the distribution from which we are sampling. For example, 
we can see if there is any evidence that the distribution is strongly skewed. 

We now consider another very useful plot for quantitative variables. 


EXAMPLE 5.5.4 Boxplots and Outliers 

Another useful plot for quantitative variables is known as a boxplot. For example, 
Figure 5.5.4 gives a boxplot for the data in Example 5.5.1. The line in the center of the 
box is the median. The line below the median is the first quartile, and the line above 
the median is the third quartile. 

The vertical lines from the quartiles are called whiskers, which run from the quar- 
tiles to the adjacent values. The adjacent values are given by the greatest value less 
than or equal to the upper limit (the third quartile plus 1.5 times the 7 OR) and by the 
least value greater than or equal to the lower limit (the first quartile minus 1.5 times 
the JOR). Values beyond the adjacent values, when these exist, are plotted with a +; 
in this case, there are none. If we changed x(19) = 5.0 to xao = 15.0, however, we 
see this extreme value plotted as a x, as shown in Figure 5.5.5. 


Figure 5.5.4: A boxplot of the data in Example 5.5.1. 


288 Section 5.5: Some Basic Inferences 


Figure 5.5.5: A boxplot of the data in Example 5.5.1, changing x19) = 5.0 to 
X(10) = 15.0. 


Points outside the upper and lower limits, and thus plotted by *, are commonly 
referred to as outliers. An outlier is a value that is extreme with respect to the rest 
of the observations. Sometimes outliers occur because a mistake has been made in 
collecting or recording the data, but they also occur simply because we are sampling 
from a long-tailed distribution. It is often difficult to ascertain which is the case in 
a particular application, but each such observation should be noted. We have seen in 
Example 5.5.3 that outliers can have a big impact on statistical analyses. Their effects 
should be recorded when reporting the results of a statistical analysis. E 


For categorical variables, it is typical to plot the data in a bar chart, as described in 
the next example. 


EXAMPLE 5.5.5 Bar Charts 
For categorical variables, we code the values of the variable as equispaced numbers 
and then plot constant-width rectangles (the bars) over these values so that the height 
of the rectangle over a value equals the proportion of times that value is assumed. Such 
a plot is called a bar chart. Note that the values along the x-axis are only labels and 
not to be treated as numbers that we can do arithmetic on, etc. 

For example, suppose we take a simple random sample of 100 students and record 
their favorite flavor of ice cream (from amongst four possibilities), obtaining the results 
given in the following table. 


Flavor Count Proportion 
Chocolate 42 
Vanilla 28 


Butterscotch 22 
Strawberry 8 


Coding Chocolate as 1, Vanilla as 2, Butterscotch as 3, and Strawberry as 4, Figure 
5.5.6 presents a bar chart of these data. It is typical for the bars in these charts not to 
touch. 


Chapter 5: Statistical Inference 289 


04 — 


Proportion 


0.2 4 


01 — 


Flavor 


Figure 5.5.6: A bar chart for the data of Example 5.5.5. 


5.5.3 | Types of Inferences 


Certainly quoting descriptive statistics and plotting the data are methods used by a sta- 
tistician to try to learn something about the underlying population distribution. There 
are difficulties with this approach, however, as we have just chosen these methods based 
on intuition. Often it is not clear which descriptive statistics we should use. Further- 
more, these data summaries make no use of the information we have about the true pop- 
ulation distribution as expressed by the statistical model, namely, fy € {fo :0 € Q}. 
Taking account of this information leads us to develop a theory of statistical inference, 
i.e., to specify how we should combine the model information together with the data to 
make inferences about population quantities. We will do this in Chapters 6, 7, and 8, 
but first we discuss the types of inferences that are commonly used in applications. 

In Section 5.2, we discussed three types of inference in the context of a known 
probability model as specified by some density or probability function f. We noted 
that we might want to do any of the following concerning an unobserved response 
value s. 


(i) Predict an unknown response value s via a prediction t. 


(ii) Construct a subset C of the sample space S that has a high probability of containing 
an unknown response value s. 


(iii) Assess whether or not sọ € S is a plausible value from the probability distribution 
specified by f. 


We refer to (i), (ii), and (iii) as inferences about the unobserved s. The examples of 
Section 5.2 show that these are intuitively reasonable concepts. 

In an application, we do not know f; we know only that f € {fo : 0 € Q}, and we 
observe the data s. We are uncertain about which candidate fg is correct, or, equiva- 
lently, which of the possible values of @ is correct. 

As mentioned in Section 5.5.1, our primary goal may be to determine not the true 
fe, but some characteristic of the true distribution such as its mean, median, or the 


290 Section 5.5: Some Basic Inferences 


value of the true distribution function F at a specified value. We will denote this 
characteristic of interest by w(0). For example, when the characteristic of interest 
is the mean of the true distribution of a continuous random variable, then 


Co 
wO)= | xf) dx. 
—00 

Alternatively, we might be interested in y (0) = By (0.5), the median of the distribu- 
tion of a random variable with distribution function given by Fg. 

Different values of 0 lead to possibly different values for the characteristic w (0). 
After observing the data s, we want to make inferences about what the correct value is. 
We will consider the three types of inference for y (0). 


(1) Choose an estimate T (s) of y(@), referred to as the problem of estimation. 


(ii) Construct a subset C(s) of the set of possible values for w(@) that we believe 
contains the true value, referred to as the problem of credible region or confidence 
region construction. 


(iii) Assess whether or not wọ is a plausible value for y(@) after having observed s, 
referred to as the problem of hypothesis assessment. 


So estimates, credible or confidence regions, and hypothesis assessment are examples 
of types of inference. In particular, we want to construct estimates T(s) of w (0), 
construct credible or confidence regions C (s) for y(@), and assess the plausibility of a 
hypothesized value wo for y(@). 

The problem of statistical inference entails determining how we should combine 
the information in the model {fg : 0 € Q} and the data s to carry out these inferences 
about w (0). 

A very important statistical model for applications is the location-scale normal 
model introduced in Example 5.3.4. We illustrate some of the ideas discussed in this 
section via that model. 


EXAMPLE 5.5.6 Application of the Location-Scale Normal Model 
Suppose the following simple random sample of the heights (in inches) of 30 students 
has been collected. 


64.9 614 663 643 65.1 644 598 63.6 66.5 65.0 
64.9 643 62.5 63.1 65.0 65.8 634 61.9 66.6 60.9 


61.6 640 61.5 642 668 664 65.8 71.4 67.8 66.3 


The statistician believes that the distribution of heights in the population can be well 
approximated by a normal distribution with some unknown mean and variance, and 
she is unwilling to make any further assumptions about the true distribution. Accord- 
ingly, the statistical model is given by the family of N(u, 07) distributions, where 
6 = (u,07) € Q = R! x R* is unknown. 

Does this statistical model make sense, i.e., is the assumption of normality appro- 
priate for this situation? The density histogram (based on 12 equal-length intervals 
from 59.5 to 71.5) in Figure 5.5.7 looks very roughly normal, but the extreme observa- 
tion in the right tail might be some grounds for concern. In any case, we proceed as if 


Chapter 5: Statistical Inference 291 


this assumption is reasonable. In Chapter 9, we will discuss more refined methods for 
assessing this assumption. 


T T T 
60 65 70 
heights 


Figure 5.5.7: Density histogram of heights in Example 5.5.6. 


Suppose we are interested in making inferences about the population mean height, 
namely, the characteristic of interest is y(u, €?) = u. Alternatively, we might want to 
make inferences about the 90th percentile of this distribution, i.e., w(u, o?) = xX0.90 = 
Lu + aZo.90, where Zo.90 is the 90th percentile of the N (0, 1) distribution (when X ~ 
N(u, 07), then P(X < u +020.90) = P((X — u) /o < 20,90) = ® (z0.90) = 0.90). 
So 90% of the population under study have height less than xo.90, a value unknown 
to us because we do not know the value of (u, 07). Obviously, there are many other 
characteristics of the true distribution about which we might want to make inferences. 

Just using our intuition, T (x1, .. ., Xn) =x seems like a sensible estimate of u and 
T (x1, .--;Xn) =X +5Z0,99 seems like a sensible estimate of u + 020,99. To justify the 
choice of these estimates, we will need the theories developed in later chapters. In this 
case, we obtain x = 64.517, and from (5.5.5) we compute s = 2.379. From Table D.2 
we obtain zo 99 = 1.2816, so that 


X + sz0.90 = 64.517 + 2.379 (1.2816) = 67.566. 


How accurate is the estimate x of u? A natural approach to answering this question 
is to construct a credible interval, based on the estimate, that we believe has a high 
probability of containing the true value of u and is as short as possible. For example, 
the theory in Chapter 6 leads to using confidence intervals for u, of the form 


[x —sc,x + sc] 


for some choice of the constant c. Notice that x is at the center of the interval. The 
theory in Chapter 6 will show that, in this case, choosing c = 0.3734 leads to what is 
known as a 0.95-confidence interval for u. We then take the half-length of this interval, 
namely, 

sc = 2,379 (0.373 4) = 0.888, 


292 Section 5.5: Some Basic Inferences 


as a measure of the accuracy of the estimate x = 64.517 of u. In this case, we have 
enough information to say that we know the true value of u to within one inch, at least 
with “confidence” equal to 0.95. 

Finally, suppose we have a hypothesized value uo for the population mean height. 
For example, we may believe that the mean height of the population of individuals 
under study is the same as the mean height of another population for which this quantity 
is known to equal uo = 65. Then, based on the observed sample of heights, we want 
to assess whether or not the value uo = 65 makes sense. If the sample mean height 
x is far from jo, this would seem to be evidence against the hypothesized value. In 
Chapter 6, we will show that we can base our assessment on the value of 


ME Ho _ 64517 - 65 _ 
~ s/n 2379/4/30 


If the value of |t| is very large, then we will conclude that we have evidence against 
the hypothesized value 9 = 65. We have to prescribe what we mean by large here, 
and we will do this in Chapter 6. It turns out that ¢ = —1.112 is a plausible value for t, 
when the true value of u equals 65, so we have no evidence against the hypothesis. E 


—1.112. 


Summary of Section 5.5 


e Descriptive statistics represent informal statistical methods that are used to make 
inferences about the distribution of a variable X of interest, based on an observed 
sample from this distribution. These quantities summarize characteristics of the 
observed sample and can be thought of as estimates of the corresponding un- 
known population quantities. More formal methods are required to assess the 
error in these estimates or even to replace them with estimates having greater 
accuracy. 


It is important to plot the data using relevant plots. These give us some idea of 
the shape of the population distribution from which we are sampling. 


There are three main types of inference: estimates, credible or confidence inter- 
vals, and hypothesis assessment. 


EXERCISES 


5.5.1 Suppose the following data are obtained by recording X, the number of cus- 
tomers that arrive at an automatic banking machine during 15 successive one-minute 


time intervals. 

2 1320 T #2 

0 2 3 1 0 0 4 
(a) Record estimates of fx (0), fx(1), fx), fx@), and fx(4). 
(b) Record estimates of Fy(0), Fy(1), Fx(2), Fx(3), and Fy (4). 
(c) Plot fy. 
(d) Record the mean and variance. 


Chapter 5: Statistical Inference 293 


(e) Record the median and IQR and provide a boxplot. Using the rule prescribed in 
Example 5.5.4, decide whether there are any outliers. 

5.5.2 Suppose the following sample of waiting times (in minutes) was obtained for 
customers in a queue at an automatic banking machine. 


15 10 2 3 1 0 4 5 

De 3 3 4 2 1 4 5 
(a) Record the empirical distribution function. 
(b) Plot fy. 
(c) Record the mean and variance. 
(d) Record the median and IQR and provide a boxplot. Using the rule given in Example 
5.5.4, decide whether there are any outliers. 
5.5.3 Suppose an experiment was conducted to see whether mosquitoes are attracted 
differentially to different colors. Three different colors of fabric were used and the 
number of mosquitoes landing on each piece was recorded over a 15-minute interval. 
The following data were obtained. 


Number of landings 
Color 1 25 


Color 2 35 
Color 3 22 


(a) Record estimates of fy(1), fx(2), and fy (3) where we use i for color i. 
(b) Does it make sense to estimate Fy(i)? Explain why or why not. 
(c) Plot a bar chart of these data. 


5.5.4 A student is told that his score on a test was at the 90th percentile in the popula- 
tion of all students who took the test. Explain exactly what this means. 


5.5.5 Determine the empirical distribution function based on the sample given below. 


10 -12 04 1.3 —0.3 
—1.4 04 —0.5 —0.2 -1.3 


0.0 —1.0 —1.3 2.0 1.0 
0.9 0.4 2.1 0.0 —1.3 


Plot this function. Determine the sample median, the first and third quartiles, and the 
interquartile range. What is your estimate of F (1)? 

5.5.6 Consider the density histogram in Figure 5.5.8. If you were asked to record 
measures of location and spread for the data corresponding to this plot, what would 
you choose? Justify your answer. 

5.5.7 Suppose that a statistical model is given by the family of N (u, o?) distributions 
where 9 = u e R! is unknown, while c? is known. If our interest is in making 
inferences about the first quartile of the true distribution, then determine w(u). 

5.5.8 Suppose that a statistical model is given by the family of N (u, o?) distributions 
where 9 = u e R! is unknown, while o? is known. If our interest is in making 
inferences about the third moment of the distribution, then determine w(u). 


294 Section 5.5: Some Basic Inferences 


5.5.9 Suppose that a statistical model is given by the family of N (u, o?) distributions 
where 0 = u e R! is unknown, while o? is known. If our interest is in making 
inferences about the distribution function evaluated at 3, then determine w(u). 

5.5.10 Suppose that a statistical model is given by the family of N (u, 07) distributions 
where 0 = (u,07) € R! x R* is unknown. If our interest is in making inferences 
about the first quartile of the true distribution, then determine y (u, 07). 

5.5.11 Suppose that a statistical model is given by the family of N (u, o?) distributions 
where 9 = (u,07) e R! x R* is unknown. If our interest is in making inferences 
about the distribution function evaluated at 3, then determine y (u, 0). 

5.5.12 Suppose that a statistical model is given by the family of Bernoulli(@) distribu- 
tions where 9 € Q = [0, 1]. Ifour interest is in making inferences about the probability 
that two independent observations from this model are the same, then determine w (0). 
5.5.13 Suppose that a statistical model is given by the family of Bernoulli(@) distribu- 
tions where 0 € Q = [0, 1]. If our interest is in making inferences about the probability 
that in two independent observations from this model we obtain a 0 and a 1, then de- 
termine y(@). 

5.5.14 Suppose that a statistical model is given by the family of Uniform[0, 0] dis- 
tributions where 0 € Q = (0, 00). If our interest is in making inferences about the 
coefficient of variation (see Exercise 5.3.5) of the true distribution, then determine 
y (0). What do you notice about this characteristic? 

5.5.15 Suppose that a statistical model is given by the family of Gamma(ao, J) distri- 
butions where 0 = $ € Q = (0, oo). If our interest is in making inferences about the 
variance of the true distribution, then determine y(@). 


0.3 4 


0.2 7 


0.0 p= j 


Figure 5.5.8: Density histogram for Exercise 5.5.6. 


COMPUTER EXERCISES 


5.5.16 Do the following based on the data in Exercise 5.4.5. 
(a) Compute the order statistics for these data. 
(b) Calculate the empirical distribution function at the data points. 


Chapter 5: Statistical Inference 295 


(c) Calculate the sample mean and the sample standard deviation. 

(d) Obtain the sample median and the sample interquartile range. 

(e) Based on the histograms obtained in Exercise 5.4.5, which set of descriptive statis- 
tics do you feel are appropriate for measuring location and spread? 


(f) Suppose the first data value was recorded incorrectly as 13.9 rather than as 3.9. 
Repeat parts (c) and (d) using this data set and compare your answers with those previ- 
ously obtained. Can you draw any general conclusions about these measures? Justify 
your reasoning. 


5.5.17 Do the following based on the data in Example 5.5.6. 
(a) Compute the order statistics for these data. 
(b) Plot the empirical distribution function (only at the sample points). 


(c) Calculate the sample median and the sample interquartile range and obtain a box- 
plot. Are there any outliers? 


(d) Based on the boxplot, which set of descriptive statistics do you feel is appropriate 
for measuring location and spread? 


(e) Suppose the first data value was recorded incorrectly as 84.9 rather than as 64.9. 
Repeat parts (c) and (d) using this data set and see whether any observations are deter- 
mined to be outliers. 


5.5.18 Generate a sample of 30 from an N (10, 2) distribution and a sample of 1 from 
an N (30, 2) distribution. Combine these together to make a single sample of 31. 


(a) Produce a boxplot of these data. 
(b) What do you notice about this plot? 


(c) Based on the boxplot, what characteristic do you think would be appropriate to 
measure the location and spread of the distribution? Explain why. 


5.5.19 Generate a sample of 50 from a y7(1) distribution. 

(a) Produce a boxplot of these data. 

(b) What do you notice about this plot? 

(c) Based on the boxplot, what characteristic do you think would be appropriate to 
measure the location and spread of the distribution? Explain why. 

5.5.20 Generate a sample of 50 from an N (4, 1) distribution. Suppose your interest is 
in estimating the 90th percentile x9.9 of this distribution and we pretend that u = 4 and 
o = | are unknown. 

(a) Compute an estimate of x99 based on the appropriate order statistic. 

(b) Compute an estimate based on the fact that xo,9 = u + .02Z0,9 where zo 9 is the 90th 
percentile of the N (0, 1) distribution. 

(c) If you knew, or at least were willing to assume, that the sample came from a normal 
distribution, which of the estimates in parts (a) or (b) would you prefer? Explain why. 


PROBLEMS 


5.5.21 Determine a formula for the sample median, based on interpolation (i.e., using 
(5.5.3)) when n is odd. (Hint: Use the least integer function or ceiling [x | = smallest 
integer greater than or equal to x.) 


296 Section 5.5: Some Basic Inferences 


5.5.22 An alternative to the empirical distribution function is to define a distribution 
function F by F(x) = Oifx < xa), Fœ) = lifx > xm, FŒ) = Faa) ifx =x, 
and N 7 

a+) aa) (x 


F(x) = Êo) + 
X(i+1) — X(i) 


— x) 


ifx@ <x <x G41 fori =1,...,n. 

(a) Show that F (xq) = F(x) fori =1,...,n and is increasing from 0 to 1. 

(b) Prove that Ë is continuous on (x(1), 00) and right continuous everywhere. 

(c) Show that, for p e [l/n, 1), the value x, defined in (5.5.3) is the solution to 
F (Xp) = p. 


DISCUSSION TOPICS 


5.5.23 Sometimes it is argued that statistics does not need a formal theory to prescribe 
inferences. Rather, statistical practice is better left to the skilled practitioner to decide 
what is a sensible approach in each problem. Comment on these statements. 

5.5.24 How reasonable do you think it is for an investigator to assume that a random 
variable is normally distributed? Discuss the role of assumptions in scientific mod- 
elling. 


Chapter 6 
Likelihood Inference 


CHAPTER OUTLINE 


Section 1 The Likelihood Function 

Section 2 Maximum Likelihood Estimation 

Section 3 Inferences Based on the MLE 

Section 4 _Distribution-Free Methods 

Section 5 Large Sample Behavior of the MLE (Advanced) 


In this chapter, we discuss some of the most basic approaches to inference. In essence, 
we want our inferences to depend only on the model {Pg : 0 € Q} and the data s. 
These methods are very minimal in the sense that they require few assumptions. While 
successful for certain problems, it seems that the additional structure of Chapter 7 or 
Chapter 8 is necessary in more involved situations. 

The likelihood function is one of the most basic concepts in statistical inference. 
Entire theories of inference have been constructed based on it. We discuss likeli- 
hood methods in Sections 6.1, 6.2, 6.3, and 6.5. In Section 6.4, we introduce some 
distribution-free methods of inference. These are not really examples of likelihood 
methods, but they follow the same basic idea of having the inferences depend on as 
few assumptions as possible. 


6.1 | The Likelihood Function 


Likelihood inferences are based only on the data s and the model {Pg : 0 € Q} — the 
set of possible probability measures for the system under investigation. From these 
ingredients we obtain the basic entity of likelihood inference, namely, the likelihood 
function. 

To motivate the definition of the likelihood function, suppose we have a statistical 
model in which each Pg is discrete, given by probability function fg. Having observed 
s, consider the function L (-|s) defined on the parameter space © and taking values in 
R}, given by 


L (@ |s) = fo ($). 


297 


298 Section 6.1: The Likelihood Function 


We refer to L (- |s) as the likelihood function determined by the model and the data. 
The value L (0 |s) is called the likelihood of 0. Note that for the likelihood function, 
we are fixing the data and varying the value of the parameter. 

We see that f(s) is just the probability of obtaining the data s when the true value 
of the parameter is 0. This imposes a belief ordering on Q, namely, we believe in @ as 
the true value of 0 over 02 whenever fg,(s) > fo, (s). This is because the inequality 
says that the data are more likely under 6; than 02. We are indifferent between 0; and 
02 whenever fo, (s) = fo, (s). Likelihood inference about 6 is based on this ordering. 

It is important to remember the correct interpretation of L (0 |s). The value L (0 | s) 
is the probability of s given that 8 is the true value — it is not the probability of 0 given 
that we have observed s. Also, it is possible that the value of L (0 |s) is very small for 
every value of 0. So it is not the actual value of the likelihood that is telling us how 
much support to give to a particular 0, but rather its value relative to the likelihoods of 
other possible parameter values. 


EXAMPLE 6.1.1 
Suppose S = {1, 2,...} and that the statistical model is {Pg : 0 e {1,2}}, where Pı is 
the uniform distribution on the integers {1, Boos, 10°} and P is the uniform distribution 


on {1,..., 10°}. Further suppose that we observe s = 10. Then Z (1110) = 1/10° 
and L (2| 10) = 1/10°. Both values are quite small, but note that the likelihood sup- 
ports 0 = 1 a thousand times more than it supports 0 = 2. E 


Accordingly, we are only interested in likelihood ratios 


L(@i\s) 
L (@2|s8) 


for 01,02 € Q when it comes to determining inferences for 0 based on the likelihood 
function. This implies that any function that is a positive multiple of L (- |s), i.e., 
L*(-|s) = cL (-|s) for some fixed c > 0, can serve equally well as a likelihood 
function. We call two likelihoods equivalent if they are proportional in this way. In 
general, we refer to any positive multiple of L (-|s) as a likelihood function. 


EXAMPLE 6.1.2 

Suppose that a coin is tossed n = 10 times and that s = 4 heads are observed. With 
no knowledge whatsoever concerning the probability of getting a head on a single 
toss, the appropriate statistical model for the data is the Binomial(10, 0) model with 
0 e Q = [0, 1]. The likelihood function is given by 


L(@|4= (eo —0)f, (6.1.1) 


which is plotted in Figure 6.1.1. 

This likelihood peaks at 0 = 0.4 and takes the value 0.2508 there. We will ex- 
amine uses of the likelihood to estimate the unknown @ and assess the accuracy of the 
estimate. Roughly speaking, however, this is based on where the likelihood takes its 
maximum and how much spread there is in the likelihood about its peak. E 


Chapter 6: Likelihood Inference 299 


0.00 + 
0.0 O1 02 03 04 05 06 07 08 09 1.0 
theta 


Figure 6.1.1: Likelihood function from the Binomial(10, 0) model when s = 4 is observed. 


There is a range of approaches to obtaining inferences via the likelihood function. 
At one extreme is the likelihood principle. 


Likelihood Principle: If two model and data combinations yield equivalent 
likelihood functions, then inferences about the unknown parameter must 
be the same. 


This principle dictates that anything we want to say about the unknown value of 0 
must be based only on L (- |s). For many statisticians, this is viewed as a very severe 
proscription. Consider the following example. 


EXAMPLE 6.1.3 

Suppose a coin is tossed in independent tosses until four heads are obtained and the 
number of tails observed until the fourth head is s = 6. Then s is distributed Negative- 
Binomial (4, 0), and the likelihood specified by the observed data is 


L(0|6) = (a — 0). 


Note that this likelihood function is a positive multiple of (6.1.1). 

So the likelihood principle asserts that these two model and data combinations must 
yield the same inferences about the unknown @. In effect, the likelihood principle says 
we must ignore the fact that the data were obtained in entirely different ways. If, how- 
ever, we take into account additional model features beyond the likelihood function, 
then it turns out that we can derive different inferences for the two situations. In partic- 
ular, assessing a hypothesized value 0 = ĝo can be carried out in different ways when 
the sampling method is taken into account. Many statisticians believe this additional 
information should be used when deriving inferences. E 


300 Section 6.1: The Likelihood Function 


As an example of an inference derived from a likelihood function, consider a set of 
the form 


C(s) = (0: L@\s) >c}, 


for some c > 0. The set C(s) is referred to as a likelihood region. It contains all those 
6 values for which their likelihood is at least c. A likelihood region, for some c, seems 
like a sensible set to quote as possibly containing the true value of 0. For, if 0* ¢ C(s), 
then ZL (0* |s) < L (0 |s) for every 0 e C(s) and so is not as well-supported by the 
observed data as any value in C(s). The size of C(s) can then be taken as a measure of 
how uncertain we are about the true value of 8. 

We are left with the problem, however, of choosing a suitable value for c and, as 
Example 6.1.1 seems to indicate, the likelihood itself does not suggest a natural way to 
do this. In Section 6.3.2, we will discuss a method for choosing c that is based upon 
additional model properties beyond the likelihood function. 

So far in this section, we have assumed that our statistical models are comprised 
of discrete distributions. The definition of the likelihood is quite natural, as Z (0 |s) 
is simply the probability of s occurring when @ is the true value. This interpretation 
is clearly not directly available, however, when we have a continuous model because 
every data point has probability 0 of occurring. Imagine, however, that fo, (s) > fo, (s) 
and that s e R!. Then, assuming the continuity of every fg ats, we have 


b b 
Py (V) = f T A e Í fan(s) dx 


for every interval V = (a, b) containing s that is small enough. We interpret this to 
mean that the probability of s occurring when 0; is true is greater than the probability 
of s occurring when 2 is true. So the data s support 0; more than 62. A similar 
interpretation applies when s € R” forn > 1 and V is a region containing s. 

Therefore, in the continuous case, we again define the likelihood function by L (0 |s) 
= fo (s) and interpret the ordering this imposes on the values of @ exactly as we do 
in the discrete case.! Again, two likelihoods will be considered equivalent if one is a 
positive multiple of the other. 

Now consider a very important example. 


EXAMPLE 6.1.4 Location Normal Model 

Suppose that (x1,...,%,) is an observed independently and identically distributed 
(i.i.d.) sample from an N(@, o?) distribution where 6 € Q = R! is unknown and 
o? > 0 is known. The likelihood function is given by 


L (0|x1,..., Xn) = I]% xi) = I] (2x02) eo- (x; — 0”) 


i=l 


'Note, however, that whenever we have a situation in which Jo, (8) = fa, (s), we could still have 
Po, (V) > Po,(V) for every V containing s, and small enough. This implies that 9) is supported more 
than 62 rather than these two values having equal support, as implied by the likelihood. This phenomenon 
does not occur in the examples we discuss, so we will ignore it here. 


Chapter 6: Likelihood Inference 301 


and clearly this simplifies to 


py aml 1X 2 
L(@|x1,...,%n) = (2703) exp| -= (x; — 9) 
200 j= 


II 
—~ 
N 
a 
e] 
oN 
a” 
| 
Q 
tal 
sc} 
| 
ol 
E 
| 
D 
— 
N 
No 
Q 
tal 
sc} 
| 
phi 
oN 
= 
wH 
N 
No 


An equivalent, simpler version of the likelihood function is then given by 


n <4 
PO\ Six Xn) = a(z (x 07), 
205 


and we will use this version. 
For example, suppose n = 25, o? = 1, and we observe x = 3.3. This function is 
plotted in Figure 6.1.2. 


0.0 + ' ; +} 
2 3 4 5 
theta 


Figure 6.1.2: Likelihood from a location normal model based on a sample of 25 with x = 3.3. 


The likelihood peaks at 6 = x = 3.3, and the plotted function takes the value 1 
there. The likelihood interval 


C(x) ={0: L (@|x1,..., Xn) > 0.5} = (3.0645, 3.53548) 


contains all those 0 values whose likelihood is at least 0.5 of the value of the likelihood 
at its peak. 

The location normal model is impractical for many applications, as it assumes that 
the variance is known, while the mean is unknown. For example, if we are interested 
in the distribution of heights in a population, it seems unlikely that we will know the 
population variance but not know the population mean. Still, it is an important statis- 
tical model, as it is a context where inference methods can be developed fairly easily. 


302 Section 6.1: The Likelihood Function 


The methodology developed for this situation is often used as a paradigm for inference 
methods in much more complicated models. E 


The parameter 0 need not be one-dimensional. The interpretation of the likelihood 
is still the same, but it is not possible to plot it — at least not when the dimension of 0 
is greater than 2. 


EXAMPLE 6.1.5 Multinomial Models 
In Example 2.8.5, we introduced multinomial distributions. These arise in applications 
when we have a categorical response variable s that can take a finite number k of values, 
say, {1,...,k}, and P (s =i) = 0i. 

Suppose, then, that k = 3 and we do not know the value of (@1, 02,03). In this 
case, the parameter space is given by 


Q = {(01, 02,63) : 0; > 0, fori = 1,2,3, and 01 +02 +63 = 1}. 


Notice that it is really only two-dimensional, because as soon as we know the value of 
any two of the 0;’s, say, 9; and 02, we immediately know the value of the remaining 
parameter, as 03 = 1 — 6; — 02. This fact should always be remembered when we are 
dealing with multinomial models. 

Now suppose we observe a sample of n from this distribution, say, (51,...,5,). 
The likelihood function for this sample is given by 


L (01,02,03 |51, <- Sn) = 070203, (6.1.2) 


where x; is the number of i’s in the sample. 

Using the fact that we can treat positive multiples of the likelihood as being equiv- 
alent, we see that the likelihood based on the observed counts (x1, x2, x3) (since they 
arise from a Multinomial(n, 01, 02, 03) distribution) is given by 


L (01, 02,03 | x1, x2, x3) = 0, 67 0, . 


This is identical to the likelihood (as functions of 01, 02, and 03) for the original sam- 
ple. It is certainly simpler to deal with the counts rather than the original sample. This 
is a very important phenomenon in statistics and is characterized by the concept of 
sufficiency, discussed in the next section. E 


6.1.1 | Sufficient Statistics 


The equivalence for inference of positive multiples of the likelihood function leads to 
a useful equivalence amongst possible data values coming from the same model. For 
example, suppose data values sı and s2 are such that L (-|s1) = cL (-|s2) for some 
c > 0. From the point of view of likelihood, we are indifferent as to whether we 
obtained the data s or the data s2, as they lead to the same likelihood ratios. 

This leads to the definition of a sufficient statistic. 


Chapter 6: Likelihood Inference 303 


Definition 6.1.1 A function 7 defined on the sample space S is called a sufficient 
statistic for the model if, whenever T (s1) = T (s2), then 


L(¢|s1) =c (1, 52) L (| 52) 


for some constant c (s1, s2) > 0. 


The terminology is motivated by the fact that we need only observe the value ¢ for the 
function T, as we can pick any value 


seT|{t}={s: T (s) =t} 


and use the likelihood based on s. All of these choices give the same likelihood ratios. 
Typically, T (s) will be of lower dimension than s, so we can consider replacing s by 
T (s) as a data reduction which simplifies the analysis somewhat. 

We illustrate the computation of a sufficient statistic in a simple context. 


EXAMPLE 6.1.6 


Suppose that S = {1,2,3,4}, Q = {a,b}, and the two probability distributions are 
given by the following table. 


s= s=2 s=3 gs=4 


0 =a 1/2 1/6 1/6 1/6 

0 =b | 1/4 1/4 1/4 1/4 
Then Z (|2) = L (13) = L (|4 (e.g., L (a |2) = 1/6 and L (b |2) = 1/4), so the 
data values in {2, 3, 4} all give the same likelihood ratios. Therefore, T : S —> {0, 1} 


given by T (1) = 0 and T (2) = T) = T (4) = 1 is a sufficient statistic. The model 
has simplified a bit, as now the sample space for T has only two elements instead of 
four for the original model. E 


The following result helps identify sufficient statistics. 


Theorem 6.1.1 (Factorization theorem) If the density (or probability function) for 


a model factors as fo(s) = h(s)ge(T (s)), where gg and h are nonnegative, then T 
is a Sufficient statistic. 


PROOF | By hypothesis, it is clear that, when T (s1) = T(s2), we have 
h(sı)go (T (s1)) 


L (isı) = A(si)ge(T(s1)) = Rn go(T en) C2 TED 
Z PED ais.) g0(T (2) =c (81,52) L (-|s2) 
(s2) 


because gg(T(s1)) = gọ (T (s2)). U 


Note that the name of this result is motivated by the fact that we have factored fg 
as a product of two functions. The important point about a sufficient statistic T is that 
we are indifferent, at least when considering inferences about 0, between observing the 
full data s or the value of T(s). We will see in Chapter 9 that there is information in 
the data, beyond the value of T (s), that is useful when we want to check assumptions. 


304 Section 6.1: The Likelihood Function 


Minimal Sufficient Statistics 


Given that a sufficient statistic makes a reduction in the data, without losing relevant 
information in the data for inferences about 6, we look for a sufficient statistic that 
makes the greatest reduction. Such a statistic is called a minimal sufficient statistic. 


Definition 6.1.2 A sufficient statistic T for a model is a minimal sufficient statistic, 


whenever the value of T (s) can be calculated once we know the likelihood function 
L(-|s). 


So a relevant likelihood function can always be obtained from the value of any suffi- 
cient statistic T, but if T is minimal sufficient as well, then we can also obtain the value 
of T from any likelihood function. It can be shown that a minimal sufficient statistic 
gives the greatest reduction of the data in the sense that, if T is minimal sufficient and 
U is sufficient, then there is a function A such that T = A(U). Note that the definitions 
of sufficient statistic and minimal sufficient statistic depend on the model, i.e., different 
models can give rise to different sufficient and minimal sufficient statistics. 

While the idea of a minimal sufficient statistic is a bit subtle, it is usually quite 
simple to find one, as the following examples illustrate. 


EXAMPLE 6.1.7 Location Normal Model 
By the factorization theorem we see immediately, from the discussion in Example 
6.1.4, that x is a sufficient statistic. Now any likelihood function for this model is a 


positive multiple of 
exp ne (x - 6) $ 
265 


Notice that any such function of 0 is completely specified by the point where it takes 
its maximum, namely, at 0 = x. So we have that x can be obtained from any likelihood 
function for this model, and it is therefore a minimal sufficient statistic. E 


EXAMPLE 6.1.8 Location-Scale Normal Model 
Suppose that (x1, ..., Xn) is a sample from an N(u, o°) distribution in which u € 
R! and ø > 0 are unknown. Recall the discussion and application of this model in 
Examples 5.3.4 and 5.5.6. 

The parameter in this model is two-dimensional and is given by 0 = (u, a?) E€ 
Q = R! x (0, 00) . Therefore, the likelihood function is given by 


—n/2 1 n 
(220°) exp (-z: >» (xi — w?) 
o i=l 


—n/2 n T n—1 
= (270°) exp (-s5 (x -— w’) exp (- 752 o ; 


We see immediately, from the factorization theorem, that (x, s?) is a sufficient statistic. 
Now, fixing c°, any positive multiple of L (- | x1, .. . , Xn) is maximized, as a func- 
tion of u, at u = x. This is independent of o°. Fixing u at ¥, we have that 


—n/2 = | 
L ((g.07) |x1,. sesi) = (220°) exp ae s? 
202 


LO |x1,...,Xn) 


Chapter 6: Likelihood Inference 305 


is maximized, as a function of o”, at the same point as InL((¥, 07) |x1,...,Xn) be- 
cause In is a strictly increasing function. Now 


ôlnL ((¥,07) |x) al ME =) 


= 2 thee = s 
ôo? a2 N eee 
— on ame 
202 2g4 


Setting this equal to 0 yields the solution 


X n—1l 
grea 
n 


which is a 1—1 function of s?. So, given any likelihood function for this model, we can 
compute (x, s?), which establishes that (¥, s?) is a minimal sufficient statistic for the 
model. In fact, the likelihood is maximized at (x, 6°) (Problem 6.1.22). E 


EXAMPLE 6.1.9 Multinomial Models 
We saw in Example 6.1.5 that the likelihood function for a sample is given by (6.1.2). 
This makes clear that if two different samples have the same counts, then they have the 
same likelihood, so the counts (x1, x2, x3) comprise a sufficient statistic. 
Now it turns out that this likelihood function is maximized by taking 
XE X2 x3 
1, 02,63) = (=, =, =). 

n n n 
So, given the likelihood, we can compute the counts (the sample size n is assumed 
known). Therefore, (x1, x2, x3) is a minimal sufficient statistic. E 


Summary of Section 6.1 


e The likelihood function for a model and data shows how the data support the 
various possible values of the parameter. It is not the actual value of the likeli- 
hood that is important but the ratios of the likelihood at different values of the 
parameter. 


e A sufficient statistic T for a model is any function of the data s such that once we 
know the value of T (s), then we can determine the likelihood function L (- |s) 
(up to a positive constant multiple). 


e A minimal sufficient statistic T for a model is any sufficient statistic such that 
once we know a likelihood function L (- |s) for the model and data, then we can 
determine T (s). 


EXERCISES 


6.1.1 Suppose a sample of n individuals is being tested for the presence of an antibody 
in their blood and that the number with the antibody present is recorded. Record an 
appropriate statistical model for this situation when we assume that the responses from 


306 Section 6.1: The Likelihood Function 


individuals are independent. If we have a sample of 10 and record 3 positives, graph a 
representative likelihood function. 
6.1.2 Suppose that suicides occur in a population at a rate p per person year and that 
p is assumed completely unknown. If we model the number of suicides observed in a 
population with a total of N person years as Poisson(Np), then record a representative 
likelihood function for p when we observe 22 suicides with N = 30, 345. 
6.1.3 Suppose that the lifelengths (in thousands of hours) of light bulbs are distributed 
Exponential(@), where 9 > 0 is unknown. If we observe x = 5.2 for a sample of 20 
light bulbs, record a representative likelihood function. Why is it that we only need to 
observe the sample average to obtain a representative likelihood? 
6.1.4 Suppose we take a sample of n = 100 students from a university with over 
50, 000 students enrolled. We classify these students as either living on campus, living 
off campus with their parents, or living off campus independently. Suppose we observe 
the counts (x1, x2, x3) = (34, 44, 22). Determine the form of the likelihood function 
for the unknown proportions of students in the population that are in these categories. 
6.1.5 Determine the constant that makes the likelihood functions in Examples 6.1.2 
and 6.1.3 equal. 
6.1.6 Suppose that (x1, .. ., Xn) is a sample from the Bernoulli(@) distribution, where 
0 € [0, 1] is unknown. Determine the likelihood function and a minimal sufficient sta- 
tistic for this model. (Hint: Use the factorization theorem and maximize the logarithm 
of the likelihood function.) 
6.1.7 Suppose (x1, ..., Xn) is a sample from the Poisson(@) distribution where 0 > 0 
is unknown. Determine the likelihood function and a minimal sufficient statistic for 
this model. (Hint: the Factorization Theorem and maximization of the logarithm of the 
likelihood function.) 
6.1.8 Suppose that a statistical model is comprised of two distributions given by the 
following table: 
s=1l s=2 s=3 
fils) | 0.3 0.1 0.6 


fo(s) | 0.1 0.7 0.2 


(a) Plot the likelihood function for each possible data value s. 

(b) Find a sufficient statistic that makes a reduction in the data. 

6.1.9 Suppose a statistical model is given by {f1, f2}, where f; is an N(i, 1) distribu- 
tion. Compute the likelihood ratio Z(1 | 0)/Z(2 | 0) and explain how you interpret this 
number. 

6.1.10 Explain why a likelihood function can never take negative values. Can a likeli- 
hood function be equal to 0 at a parameter value? 

6.1.11 Suppose we have a statistical model {fə : 0 € [0, 1]} and we observe xo. Is it 
true that f L(@|x0) d@ = 1? Explain why or why not. 

6.1.12 Suppose that (x1, ..., Xn) is a sample from a Geometric (0) distribution, where 
0 e [0, 1] is unknown. Determine the likelihood function and a minimal sufficient sta- 


tistic for this model. (Hint: Use the factorization theorem and maximize the logarithm 
of the likelihood.) 


Chapter 6: Likelihood Inference 307 


6.1.13 Suppose you are told that the likelihood of a particular parameter value is 10°. 
Is it possible to interpret this number in any meaningful way? Explain why or why not. 
6.1.14 Suppose one statistician records a likelihood function as 6” for 0 € [0, 1] while 
another statistician records a likelihood function as 1000? for 0 € [0, 1]. Explain why 
these likelihood functions are effectively the same. 


PROBLEMS 


6.1.15 Show that T defined in Example 6.1.6 is a minimal sufficient statistic. (Hint: 
Show that once you know the likelihood function, you can determine which of the two 
possible values for T has occurred.) 

6.1.16 Suppose that S = {1, 2,3, 4}, Q = {a, b, c}, where the three probability distri- 
butions are given by the following table. 


1/2 1/6 1/6 


1/4 1/4 1/4 
1/2 1/4 1/4 


Determine a minimal sufficient statistic for this model. Is the minimal sufficient statis- 
tic in Example 6.1.6 sufficient for this model? 

6.1.17 Suppose that (x1, ...,Xn) is a sample from the N (u, o?) distribution where 
u € R! is unknown. Determine the form of likelihood intervals for this model. 

6.1.18 Suppose that (x1,..., Xn) € R” is a sample from fg, where 0 € Q is un- 
known. Show that the order statistics (x(1), ..., X()) comprise a sufficient statistic for 
the model. 

6.1.19 Determine a minimal sufficient statistic for a sample of n from the rate gamma 
model, i.e., 


a0 


fa(x) = -! exp {-Ox} 
for x > 0, 0 > 0 and where ao > 0 is fixed. 


6.1.20 Determine the form of a minimal sufficient statistic for a sample of size n from 
the Uniform[0, 0] model where 0 > 0. 


6.1.21 Determine the form of a minimal sufficient statistic for a sample of size n from 
the Uniform[@1, 02] model where 6; < 02. 

6.1.22 For the location-scale normal model, establish that the point where the likeli- 
hood is maximized is given by (x, 6°) as defined in Example 6.1.8. (Hint: Show that 
the second derivative of In L ((¥, 07) |x), with respect to a”, is negative at 6° and then 
argue that (x, 6°) is the maximum.) 

6.1.23 Suppose we have a sample of n from a Bernoulli(@) distribution where 0 e€ 
[0, 0.5]. Determine a minimal sufficient statistic for this model. (Hint: It is easy to 
establish the sufficiency of x, but this point will not maximize the likelihood when 
x > 0.5, so x cannot be obtained from the likelihood by maximization, as in Exercise 
6.1.6. In general, consider the second derivative of the log of the likelihood at any point 


308 Section 6.2: Maximum Likelihood 


0 e (0, 0.5) and note that knowing the likelihood means that we can compute any of 
its derivatives at any values where these exist.) 


6.1.24 Suppose we have a sample of n from the Multinomial(1, 0, 20, 1 — 30) distri- 
bution, where 0 e€ [0, 1/3] is unknown. Determine the form of the likelihood function 
and show that x; + x2 is a minimal sufficient statistic where x; is the number of sample 
values corresponding to an observation in the ith category. (Hint: Problem 6.1.23.) 


6.1.25 Suppose we observe s from a statistical model with two densities, fı and fo. 
Show that the likelihood ratio T(s) = fi(s)/f2(s) is a minimal sufficient statistic. 
(Hint: Use the definition of sufficiency directly.) 


CHALLENGES 


6.1.26 Consider the location-scale gamma model, i.e., 


1 — pw\%1 aryl 
fun =ras(Z*) | She 


forx > u € R!,o > 0 and where ao > 0 is fixed. 


(a) Determine the minimal sufficient statistic for a sample of n when ag = 1. (Hint: 
Determine where the likelihood is positive and calculate the partial derivative of the 
log of the likelihood with respect to u.) 

(b) Determine the minimal sufficient statistic for a sample of n when ao Æ 1. (Hint: 
Use Problem 6.1.18, the partial derivative of the log of the likelihood with respect to 
u, and determine where it is infinite.) 


DISCUSSION TOPICS 


6.1.27 How important do you think it is for a statistician to try to quantify how much 
error there is in an inference drawn? For example, if an estimate is being quoted for 
some unknown quantity, is it important that the statistician give some indication about 
how accurate (or inaccurate) this inference is? 


6.2 | Maximum Likelihood Estimation 


In Section 6.1, we introduced the likelihood function L (- |s) as a basis for making 
inferences about the unknown true value 0 € Q. We now begin to consider the specific 
types of inferences discussed in Section 5.5.3 and start with estimation. 

When we are interested in a point estimate of 0, then a value G(s) that maximizes 
L(@|s) is a sensible choice, as this value is the best supported by the data, i.e., 


LÊ (s) |s) > L(@|s) (6.2.1) 
for every 0 € Q. 


Definition 6.2.1 We call Ô : S > Q satisfying (6.2.1) for every 0 € Q a maximum 


likelihood estimator, and the value 6 (s) is called a maximum likelihood estimate, 
or MLE for short. 


Chapter 6: Likelihood Inference 309 


Notice that, if we use cL (-|s) as the likelihood function, for fixed c > 0, then A(s) 
is also an MLE using this version of the likelihood. So we can use any version of the 
likelihood to calculate an MLE. 


EXAMPLE 6.2.1 
Suppose the sample space is S = {1, 2, 3}, the parameter space is Q = {1, 2}, and the 
model is given by the following table. 


s=l s=2 s=3 
fils) 0.3 0.4 0.3 


fo(s) | 0.1 0.7 0.2 


Further suppose we observe s = 1. So, for example, we could be presented with one of 
two bowls of chips containing these proportions of chips labeled 1, 2, and 3. We draw 
a chip, observe that it is labelled 1, and now want to make inferences about which bowl 
we have been presented with. 

In this case, the MLE is given by A(1) = l, since 0.3 = L(1 |1) > 0.1 = L (2| 1). 
If we had instead observed s = 2, then 6(2) = 2; if we had observed s = 3, then 
63) = 1.8 


Note that an MLE need not be unique. For example, in Example 6.2.1, if f2 was 
defined by f2(1) = 0, fo(2) = 0.7 and f2(3) = 0.3, then an MLE is as given there, 
but putting ôG) = 2 also gives an MLE. 

The MLE has a very important invariance property. Suppose we reparameterize a 
model via a 1-1 function ¥ defined on Q. By this we mean that, instead of labelling the 
individual distributions in the model using 0 € Q, we use y € Y = {VW (0) : 0 e Q}. 
For example, in Example 6.2.1, we could take Y(1) = a and Y¥(2) = b so that Y = 
{a, b} . So the model is now given by {gy yE T}, where gy = fg for the unique 
value 0 such that ¥ (0) = w. We have a new parameter y and a new parameter space 
Y. Nothing has changed about the probability distributions in the statistical model, 
only the way they are labelled. We then have the following result. 


Theorem 6.2.1 If A(s) is an MLE for the original parameterization and, if ¥ is a 


1-1 function defined on Q, then y(s) = ¥(@(s)) is an MLE in the new parameter- 
ization. 


PROOF | If we select the likelihood function for the new parameterization to be 
L*(w|s) = gy(s), and the likelihood for the original parameterization to be L (0 |s) 
= fo(s), then we have 


L*(w (8) 15) = Sy @@y ©) = fag) = LOG) 1s) = LO Is) = L*(¥O) 1s) 
for every 0 € Q. This implies that L*(y (s) |s) > L*(w|s) for every y e Y and 
establishes the result. E 


Theorem 6.2.1 shows that no matter how we parameterize the model, the MLE behaves 
in a consistent way under the reparameterization. This is an important property, and 
not all estimation procedures satisfy this. 


310 Section 6.2: Maximum Likelihood 


6.2.1 | Computation of the MLE 


An important issue is the computation of MLEs. In Example 6.2.1, we were able to 
do this by simply examining the table giving the distributions. With more complicated 
models, this approach is not possible. In many situations, however, we can use the 
methods of calculus to compute A(s). For this we require that fo(s) be a continuously 
differentiable function of 0 so that we can use optimization methods from calculus. 

Rather than using the likelihood function, it is often convenient to use the log- 
likelihood function. 


Definition 6.2.2 For likelihood function L(-|s), the log-likelihood function I(- |s) 


defined on Q, is given by /(|s) =InZ(-|s). 


Note that In (x) is a 1-1 increasing function of x > 0 and this implies that L ô (s)|s) > 
L(0 |s) for every 0 € Q if and only if 1(A(s) |s) > 1(@|s) for every 0 € Q. So we 
can maximize /(-|s) instead when computing an MLE. The convenience of the log- 
likelihood arises from the fact that, for a sample (s1, ..., Sn) from {fg : 0 € Q}, the 
likelihood function is given by 


n 
LOIS,- Sn) =] J fol) 
i=l 
whereas the log-likelihood is given by 


I(0 |S1,...55n) = >In foi). 


i=l 


It is typically much easier to differentiate a sum than a product. 

Because we are going to be differentiating the log-likelihood, it is convenient to 
give a name to this derivative. We define the score function S(0 |s) of a model to 
be the derivative of its log-likelihood function whenever this exists. So when @ is a 
one-dimensional real-valued parameter, then 


al(@|s) 
00” 
provided this partial derivative exists (see Appendix A.5 for a definition of partial deriv- 


ative). We restrict our attention now to the situation in which @ is one-dimensional. 
To obtain the MLE, we must then solve the score equation 


S(0 |s) = 


S@|s) =0 (6.2.2) 


for 0. Of course, a solution to (6.2.2) is not necessarily an MLE, because such a point 
may be a local minimum or only a local maximum rather than a global maximum. To 
guarantee that a solution @(s) is at least a local maximum, we must also check that 


dS(0|s) _ &1@\s) 


<0. (6.2.3) 
00 |9=As) 30? lo-ô6) 


Chapter 6: Likelihood Inference 311 


Then we must evaluate / (- |s) at each local maximum in order to determine the global 
maximum. 
Let us compute some MLEs using calculus. 


EXAMPLE 6.2.2 Location Normal Model 
Consider the likelihood function 


LO Ixi,- Xn) -o0(-5 G -0), 
205 


obtained in Example 6.1.4 for a sample (x1,...,x,) from the N@, o?) model where 
6 € R! is unknown and o? is known. The log-likelihood function is then 


n 
I0 |x, TE Gn) = = AA (x 0), 
205 
and the score function is 


Ne 3 
S@|x1,.-.,%n) = -> @ -— 0). 
a9 


The score equation is given by 


Z @-0)=0. 
a 
Solving this for 0 gives the unique solution A(x1,..., Xn) = x. To check that this is a 
local maximum, we calculate 
OS | x1, ..., Xn) _ 8 
60 o OG 


which is negative, and thus indicates that x is a local maximum. Because we have only 
one local maximum, it is also the global maximum and we have indeed obtained the 
MLE. 


EXAMPLE 6.2.3 Exponential Model 
Suppose that a lifetime is known to be distributed Exponential(1/0), where 0 > 0 is 
unknown. Then based on a sample (x1, ..., Xn), the likelihood is given by 


1 nx 
L(@|x1,...,Xn) = gr P =7 J> 


the log-likelihood is given by 


IØ |x1,-..5%n) = -n n0 - =, 


and the score function is given by 


n nx 
SOP a 


312 Section 6.2: Maximum Likelihood 


Solving the score equation gives Ê (x1,... , Xn) = X, and because x > 0, 
OS(6|x1,...,%Xn) a has stp XG, 
00 o P a: x? 


so x is indeed the MLE. i 


In both examples just considered, we were able to derive simple formulas for the 
MLE. This is not always possible. Consider the following example. 


EXAMPLE 6.2.4 

Consider a population in which individuals are classified according to one of three types 
labelled 1, 2, and 3, respectively. Further suppose that the proportions of individuals 
falling in these categories are known to follow the law pı = 0, p2 = 02, p=1-—0-— 
62, where 


6 €[0, (v5 — 1)/2] = [0, 0.618 03] 


is unknown. Here, p; denotes the proportion of individuals in the ith class. Note that 
the requirement that 0 < @ + 6? < 1 imposes the upper bound on @, and the precise 
bound is obtained by solving 0 + 6? — 1 = 0 for @ using the formula for the roots 
of a quadratic. Relationships like this, amongst the proportions of the distribution of 
a categorical variable, often arise in genetics. For example, the categorical variable 
might serve to classify individuals into different genotypes. 

For a sample of n (where n is small relative to the size of the population so that we 
can assume observations are i.i.d.), the likelihood function is given by 


L(O|x1,...,%n) =O" (1 — 0 — 07), 
where x; denotes the sample count in the ith class. The log-likelihood function is then 
1(0|81,..-8n) = («1 + 2x2) Ind +x3 In(1 — 8 — 6”), 


and the score function is 


(x1 + 2x2) _ XB (1 + 20) 


S(O |S1,..., Sn) = A aay ae 


The score equation then leads to a solution ô being a root of the quadratic 
(x) + 2x7) (1 — 0 — 07) — x3 (0 + 267) 
= — (x1 + 2x2 + 2x3) 0? — (x1 + 2x2 +23) + (x1 +222). 
Using the formula for the roots of a quadratic, we obtain 
x 1 
~ 21 + 2x2 + 2x3) 


x (= — 2x2 — x3 + q 5x? + 20x1x2 + 10x1x3 + 20x2 + 20x2x3 + x) ; 


Notice that the formula for the roots does not determine the MLE in a clear way. In 
fact, we cannot even tell if either of the roots lies in [0, 1]! So there are four possible 


Chapter 6: Likelihood Inference 313 


values for the MLE at this point — either of the roots or the boundary points 0 and 
0.61803. 

We can resolve this easily in an application by simply numerically evaluating the 
likelihood at the four points. For example, if x; = 70,x2 = 5, and x3 = 25, then the 
roots are —1.28616 and 0.47847, so it is immediate that the MLE is 6 (x1,..-,Xn) = 
0.47847. We can see this graphically in the plot of the log-likelihood provided in Fig- 
ure 6.2.1.1 


theta 


-90 e 


-955 T 
-100 T 
-105 T 
-110 7 


-115 T 


-120 T 
InL 


Figure 6.2.1: The log-likelihood function in Example 6.2.4 when x; = 70, x2 = 5, and 
x3 = 25. 


In general, the score equation (6.2.2) must be solved numerically, using an iterative 
routine like Newton—Raphson. Example 6.2.4 demonstrates that we must be very care- 
ful not to just accept a solution from such a procedure as the MLE, but to check that the 
fundamental defining property (6.2.1) is satisfied. We also have to be careful that the 
necessary smoothness conditions are satisfied so that calculus can be used. Consider 
the following example. 


EXAMPLE 6.2.5 Uniform[0, 0] Model 
Suppose (x1,...,%,) is a sample from the Uniform[0, 0] model where 0 > 0 is un- 
known. Then the likelihood function is given by 


L@|x1,...,Xn) 


g-" xi <@fori=1,...,n 
0 x; > 0 for some i 


A" Tix) ,00) (0), 


where x(n) is the largest order statistic from the sample. In Figure 6.2.2, we have 
graphed this function when n = 10 and x(n) = 1.916. Notice that the maximum clearly 
occurs at x(n); we cannot obtain this value via differentiation, as Z (- | x1, .. . , Xn) is not 
differentiable there. E 


314 Section 6.2: Maximum Likelihood 


0.0015 T 


0.0010 + 


0.0005 T 


0.0000 1—— + — 
0 5 
theta 


Figure 6.2.2: Plot of the likelihood function in Example 6.2.5 when n = 10 and 
X(10) = 1.916. 


The lesson of Examples 6.2.4 and 6.2.5 is that we have to be careful when com- 
puting MLEs. We now look at an example of a two-dimensional problem in which the 
MLE can be obtained using one-dimensional methods. 


EXAMPLE 6.2.6 Location-Scale Normal Model 

Suppose that (x1, ...,Xn) is a sample from an N(u, g?) distribution, where u € R! 
and ø > 0 are unknown. The parameter in this model is two-dimensional, given by 
0 = (u, a?) € Q = R! x (0, 00). The likelihood function is then given by 


n n—1 
LW, 0? |x1,.-25%n) = Cro" exp (—5  - u) exp -50 
262 20 


as shown in Example 6.1.8. The log-likelihood function is given by 


n=l, 
752 s^. (6.2.4) 


n n n 
(u, 0° |x1,-. -3 Xn) = —7 n2a =o - 55 & - ay - 


As discussed in Example 6.1.8, it is clear that, for fixed 0, (6.2.4) is maximized, as a 
function of u, by à = x. Note that this does not involve o”, so this must be the first 
coordinate of the MLE. 

Substituting u = x into (6.2.4), we obtain 


-1 
imi (6.2.5) 


-5 In2z — 5 Ino? — 


and the second coordinate of the MLE must be the value of c? that maximizes (6.2.5). 
Differentiating (6.2.5) with respect to ø? and setting this equal to 0 gives 


n n=l 3 


St aay (6.2.6) 


Chapter 6: Likelihood Inference 315 


Solving (6.2.6) for a” leads to the solution 


N n—-1 Tat A 
Gr s=- - 3%). 
n MIn 


Differentiating (6.2.6) with respect to o*, and substituting in 6°, we see that the second 


derivative is negative, hence ô^ is a point where the maximum is attained. 
Therefore, we have shown that the MLE of (u : a?) is given by 


(ibe -27). 


In the following section we will show that this result can also be obtained using multi- 
dimensional calculus. E 


So far we have talked about estimating only the full parameter 0 for a model. What 
about estimating a general characteristic of interest wy (0) for some function y defined 
on the parameter space Q? Perhaps the obvious answer here is to use the estimate 
w (s) = y (Ô(s)) where 6(s) is an MLE of 8. This is sometimes referred to as the plug- 
in MLE of y. Notice, however, that the plug-in MLE is not necessarily a true MLE, in 
the sense that we have a likelihood function for a model indexed by y and that takes 
its maximum value at y(s). If y is a 1-1 function defined on Q, then Theorem 6.2.1 
establishes that y(s) is a true MLE but not otherwise. 

If y is not 1-1, then we can often find a complementing function À defined on Q so 
that (w, 4) is a 1—1 function of 0. Then, by Theorem 6.2.1, 


(w(s), A(s)) = (w@(s)), 2@(s))) 


is the joint MLE, but w(s) is still not formally an MLE. Sometimes a plug-in MLE can 
perform badly, as it ignores the information in 1(@(s)) about the true value of y. An 
example illustrates this phenomenon. 


EXAMPLE 6.2.7 Sum of Squared Means 
Suppose that X; ~ N (u;, 1) fori = 1,..., and that these are independent with the 
u;i completely unknown. So here, 0 = (ui, wed Ln) and Q = R”. Suppose we want 


to estimate y(@) = u +-+ u. 
The log-likelihood function is given by 


1 n 
1@|x1,....%n) = =3 DiGi mi)’. 
i=l 


Clearly this is maximized by A(x1,..., Xn) = (%1,...,Xn). So the plug-in MLE of y 
is given by y = J; x?. 
Now observe that 


Eo (=) = 2 mD z 2 Vo +u) =n + y0), 


316 Section 6.2: Maximum Likelihood 


where Eg(g) refers to the expectation of g(s) when s ~ fg. So when n is large, it is 
likely that y is far from the true value. An immediate improvement in this estimator is 
to use )7/_, x? — n instead. E 


There have been various attempts to correct problems such as the one illustrated in 
Example 6.2.7. Typically, these involve modifying the likelihood in some way. We do 
not pursue this issue further in this text but we do advise caution when using plug-in 
MLEs. Sometimes, as in Example 6.2.6, where we estimate u by ¥ and g? by s*, they 
seem appropriate; other times, as in Example 6.2.7, they do not. 


6.2.2 | The Multidimensional Case (Advanced) 


We now consider the situation in which 0 = (91,...,0%) € RÝ is multidimensional, 
i.e., k > 1. The likelihood and log-likelihood are then defined just as before, but the 
score function is now given by 
al |s) 
00 
al@|s) 
SG@isy=| J, 


al(6 |s) 


OOK 


provided all these partial derivatives exist. For the score equation, we get 


al@|s) 
00; 0 
al(9 |s) 
002 = 0 
al(9 ls) 0 
ôk 
and we must solve this k-dimensional equation for (01, ...,0x). This is often much 


more difficult than in the one-dimensional case, and we typically have to resort to 
numerical methods. 

A necessary and sufficient condition for (61, P 6x) to be a local maximum, when 
the log-likelihood has continuous second partial derivatives, is that the matrix of second 
partial derivatives of the log-likelihood, evaluated at (61, ahs 8x), must be negative 
definite (equivalently, all of its eigenvalues must be negative). We then must evaluate 
the likelihood at each of the local maxima obtained to determine the global maximum 
or MLE. 

We will not pursue the numerical computation of MLEs in the multidimensional 
case any further here, but we restrict our attention to a situation in which we carry out 
the calculations in closed form. 


EXAMPLE 6.2.8 Location-Scale Normal Model 
We determined the log-likelihood function for this model in (6.2.4). The score function 
is then 


Chapter 6: Likelihood Inference 317 


OS(O | x1,...,Xn) 
Naser ae) ( ) 


The score equation is 


a ie ea 
ae ceed) ee cs ee a 


and the first of these equations immediately implies that & = x. Substituting this value 
for u into the second equation and solving for ø? leads to the solution 


-1 
pes s = >a Gg=a). 
From Example 6.2.6, we know that this solution does indeed give the MLE. E 


Summary of Section 6.2 


e An MLE (maximum likelihood estimator) is a value of the parameter 0 that max- 
imizes the likelihood function. It is the value of @ that is best supported by the 
model and data. 

e We can often compute an MLE by using the methods of calculus. When ap- 
plicable, this leads to solving the score equation for 0 either explicitly or using 
numerical algorithms. Always be careful to check that these methods are ap- 
plicable to the specific problem at hand. Furthermore, always check that any 
solution to the score equation is a maximum and indeed an absolute maximum. 


EXERCISES 


6.2.1 Suppose that S = {1,2,3,4}, Q = {a,b}, where the two probability distribu- 
tions are given by the following table. 


s=] s=2 s=3 s=4 


@=a| 12 1/6 1/6 1/6 
6=b| 13 18 1/3 0 


Determine the MLE of 0 for each possible data value. 


6.2.2 If (x1, ...,Xn) is a sample from a Bernoulli(@) distribution, where 6 e€ [0, 1] is 
unknown, then determine the MLE of 8. 
6.2.3 If (x1,...,Xn) is a sample from a Bernoulli(@) distribution, where 6 e [0, 1] is 


unknown, then determine the MLE of 02. 


318 Section 6.2: Maximum Likelihood 


6.2.4 If (x1, ..., Xn) is a sample from a Poisson(@) distribution, where 0 € (0, oo) is 
unknown, then determine the MLE of 0. 

6.2.5 If (x1, ...,Xn) is a sample from a Gamma(ao, 0) distribution, where ag > 0 and 
0 e (0, co) is unknown, then determine the MLE of 6. 

6.2.6 Suppose that (x1, ..., Xn) is the result of independent tosses of a coin where we 
toss until the first head occurs and where the probability of a head on a single toss is 
6 e (0, 1]. Determine the MLE of 0. 

6.2.7 If (x1,...,Xn) is a sample from a Beta(a, 1) distribution (see Problem 2.4.24) 
where a > 0 is unknown, then determine the MLE of a. (Hint: Assume I (a) is a 
differentiable function of a.) 

6.2.8 If (x1, ...,Xn) is a sample from a Weibull() distribution (see Problem 2.4.19), 
where £ > 0 is unknown, then determine the score equation for the MLE of £. 

6.2.9 If (x1, ..., Xn) is a sample from a Pareto(a) distribution (see Problem 2.4.20), 
where a > 0 is unknown, then determine the MLE of a. 

6.2.10 If (x1,..., Xn) is a sample from a Log-normal(r) distribution (see Problem 
2.6.17), where t > 0 is unknown, then determine the MLE of t. 

6.2.11 Suppose you are measuring the volume of a cubic box in centimeters by taking 
repeated independent measurements of one of the sides. Suppose it is reasonable to as- 
sume that a single measurement follows an N (u, o?) distribution, where u is unknown 
and o? is known. Based on a sample of measurements, you obtain the MLE of u as 3.2 
cm. What is your estimate of the volume of the box? How do you justify this in terms 
of the likelihood function? 

6.2.12 If (x1, ..., Xn) is a sample from an N (uo, 07) distribution, where oc? > 0 is 
unknown and go is known, then determine the MLE of ø?. How does this MLE differ 
from the plug-in MLE of ø? computed using the location-scale normal model? 

6.2.13 Explain why it is not possible that the function 6° exp(—(0 — 5.3)*) for 0 e R! 
is a likelihood function. 

6.2.14 Suppose you are told that a likelihood function has local maxima at the points 
—2.2,4.6 and 9.2, as determined using calculus. Explain how you would determine 
the MLE. 

6.2.15 If two functions of 0 are equivalent versions of the likelihood when one is a 
positive multiple of the other, then when are two log-likelihood functions equivalent? 
6.2.16 Suppose you are told that the likelihood of 0 at 0 = 2 is given by 1/4. Is this 
the probability that 0 = 2? Explain why or why not. 


COMPUTER EXERCISES 


6.2.17 A likelihood function is given by exp(—(@ — 1)?/2) + 3 exp(—(@ — 2)?/2) 
for 9 e R!. Numerically approximate the MLE by evaluating this function at 1000 
equispaced points in (—10, 10]. Also plot the likelihood function. 

6.2.18 A likelihood function is given by exp(—(@ — 1)?/2) + 3 exp(—(@ — 5)?/2) 
for 0 e R!. Numerically approximate the MLE by evaluating this function at 1000 
equispaced points in (—10, 10]. Also plot the likelihood function. Comment on the 
form of likelihood intervals. 


Chapter 6: Likelihood Inference 319 


PROBLEMS 


6.2.19 (Hardy—Weinberg law) The Hardy—Weinberg law in genetics says that the pro- 
portions of genotypes 4A, Aa, and aa are 67, 20 (1 — 0), and (1 — 6), respectively, 
where 0 e [0,1]. Suppose that in a sample of n from the population (small relative 
to the size of the population), we observe xı individuals of type 4A, x2 individuals of 
type Aa, and x3 individuals of type aa. 
(a) What distribution do the counts (X1, X2, X3) follow? 
(b) Record the likelihood function, the log-likelihood function, and the score function 
for 8. 
(c) Record the form of the MLE for 0. 
6.2.20 If (x1,..., Xn) is a sample from an N(y, 1) distribution where u € R! is un- 
known, determine the MLE of the probability content of the interval (—oo, 1). Justify 
your answer. 
6.2.21 If (x1, ..., Xn) is a sample from an N(x, 1) distribution where u > 0 is un- 
known, determine the MLE of u. 
6.2.22 Prove that, if A(s) is the MLE for a model for response s and if T is a sufficient 
statistic for the model, then @(s) is also the MLE for the model for T (s). 
6.2.23 Suppose that (X1, X2, X3) ~ Multinomial(n, 01, 02,03) (see Example 6.1.5), 
where 

Q = {(01,02,03) : 0 < 6; < 1,01 +02 +03 = 1} 
and we observe (X1, X2, X3) = (x1, x2, x3). 
(a) Determine the MLE of (01, 02, 03). 
(b) What is the plug-in MLE of 01 + 02 = 03? 
6.2.24 If (x1,..., Xn) is a sample from a Uniform[@1, 02] distribution with 


Q = {(01,02) € R? : 01 < 62}, 


determine the MLE of (01,02). (Hint: You cannot use calculus. Instead, directly 
determine the maximum over 6; when 02 is fixed, and then vary 02.) 


COMPUTER PROBLEMS 


6.2.25 Suppose the proportion of left-handed individuals in a population is 0. Based 
on a simple random sample of 20, you observe four left-handed individuals. 

(a) Assuming the sample size is small relative to the population size, plot the log- 
likelihood function and determine the MLE. 

(b) If instead the population size is only 50, then plot the log-likelihood function and 
determine the MLE. (Hint: Remember that the number of left-handed individuals fol- 
lows a hypergeometric distribution. This forces 0 to be of the form i/50 for some 
integer i between 4 and 34. From a tabulation of the log-likelihood, you can obtain the 
MLE.) 


320 Section 6.3: Inferences Based on the MLE 


CHALLENGES 


6.2.26 If (x1,...,Xn) is a sample from a distribution with density 
fo(x) = (1/2) exp (— |x — 4]) 


for x e R! and where 0 €e R! is unknown, then determine the MLE of @. (Hint: 
You cannot use calculus. Instead, maximize the log-likelihood in each of the intervals 
(—o0, x(1)), vq) < 8 < xQ)), ete.). 


DISCUSSION TOPICS 


6.2.27 One approach to quantifying the uncertainty in an MLE G(s) is to report the 
MLE together with a likelihood interval {0 : L(@|s) > cL (6 (s) | s)} for some constant 
c € (0,1). What problems do you see with this approach? In particular, how would 
you choose c? 


6.3 | Inferences Based on the MLE 


In Table 6.3.1. we have recorded n = 66 measurements of the speed of light (pas- 
sage time recorded as deviations from 24, 800 nanoseconds between two mirrors 7400 
meters apart) made by A. A. Michelson and S. Newcomb in 1882. 


Table 6.3.1: Speed of light measurements. 


Figure 6.3.1 is a boxplot of these data with the variable labeled as x. Notice there 
are two outliers at x = —2 and x = —44. We will presume there is something very 
special about these observations and discard them for the remainder of our discussion. 


Figure 6.3.1: Boxplot of the data values in Table 6.3.1. 


Chapter 6: Likelihood Inference 321 


Figure 6.3.2 presents a histogram of these data minus the two data values identified 
as outliers. Notice that the histogram looks reasonably symmetrical, so it seems plau- 
sible to assume that these data are from an N (u, a?) distribution for some values of u 
and o*. Accordingly, a reasonable statistical model for these data would appear to be 
the location-scale normal model. In Chapter 9, we will discuss further how to assess 
the validity of the normality assumption. 


Density 
Oo 
R 
l 


Figure 6.3.2: Density histogram of the data in Table 6.3.1 with the outliers removed. 


If we accept that the location-scale normal model makes sense, the question arises 
concerning how to make inferences about the unknown parameters u and c°. The 
purpose of this section is to develop methods for handling problems like this. The 
methods developed in this section depend on special features of the MLE in a given 
context. In Section 6.5, we develop a more general approach based on the MLE. 


6.3.1 | Standard Errors, Bias, and Consistency 


Based on the justification for the likelihood, the MLE 0 (s) seems like a natural estimate 
of the true value of 0. Let us suppose that we will then use the plug-in MLE estimate 
w(s) = w(0(s)) for a characteristic of interest y(@) (e.g., w(@) might be the first 
quartile or the variance). 

In an application, we want to know how reliable the estimate y (s) is. In other 
words, can we expect y (s) to be close to the true value of w (0), or is there a reasonable 
chance that w(s) is far from the true value? This leads us to consider the sampling 
distribution of Y (s), as this tells us how much variability there will be in y (s) under 
repeated sampling from the true distribution fg. Because we do not know what the true 
value of @ is, we have to look at the sampling distribution of w(s) for every 0 € Q. 

To simplify this, we substitute a numerical measure of how concentrated these sam- 
pling distributions are about w (0). Perhaps the most commonly used measure of the 
accuracy of a general estimator T (s) of y(@), i.e., we are not restricting ourselves to 
plug-in MLEs, is the mean-squared error. 


322 Section 6.3: Inferences Based on the MLE 


Definition 6.3.1 The mean-squared error (MSE) of the estimator T of y(0) € R!, 


is given by MSEg(T) = Eg((T — w(@))) for each 8 € Q. 


Clearly, the smaller MSEg(7) is, the more concentrated the sampling distribution of 
T (s) is about the value y(@). 

Looking at MSE@(7) as a function of 0 gives us some idea of how reliable T (s) 
is as an estimate of the true value of y(@). Because we do not know the true value of 
0, and thus the true value of MSEg(7), statisticians record an estimate of the mean- 
squared error at the true value. Often 


MSE 5.) (T) 
is used for this. In other words, we evaluate MSEg(T) at 8 = 6(s) as a measure of the 
accuracy of the estimate T (s). 

The following result gives an important identity for the MSE. 


Theorem 6.3.1 If y(0) € R! and T is a real-valued function defined on S such 
that Eo (T) exists, then 


MSE% (T) = Varg (T) + (Eo(T) — y (0)? . (6.3.1) 


PROOF | We have 


Eo((T — y (0Y = Eo((T — Eo (T) + Eo(T) — y (0) 
= Eo((T — Eo(T))”) 
+ 2Eg((T — Eo(T)) (Eo (T) — w (0))) + (Eo (T) — y (0)? 
= Varo (T) + (Eo (T) — wy) 


because 


Eo((T — Eo (T)) (Eo (T) — y @))) (Eo (T — Eo(T))) (Eo (T) — y (0)) 


(0 


The second term in (6.3.1) is the square of the bias in the estimator T. 


Definition 6.3.2 The bias in the estimator T of w (0) is given by Eg(T) — y (0) 
whenever F(T) exists. When the bias in an estimator T is 0 for every 0, we call T 


an unbiased estimator of y, i.e., T is unbiased whenever Eg(7) = wy (0) for every 
0 EQ. 


Note that when the bias in an estimator is 0, then the MSE is just the variance. 
Unbiasedness tells us that, in a sense, the sampling distribution of the estimator is 
centered on the true value. For unbiased estimators, 


MSE; (T) = Varg) (T) 


Chapter 6: Likelihood Inference 323 


Sd9(5)(T) = [Vara s)(T) 


is an estimate of the standard deviation of T and is referred to as the standard error 
of the estimate T(s). As a principle of good statistical practice, whenever we quote 
an estimate of a quantity, we should also provide its standard error — at least when 
we have an unbiased estimator, as this tells us something about the accuracy of the 
estimate. 

We consider some examples. 


EXAMPLE 6.3.1 Location Normal Model 
Consider the likelihood function 


and 


n z 
L (u |x1, wy Xn) = o0(-35 (x -w?), 
205 


obtained in Example 6.1.4 for a sample (x1, ...,X,) from the N (u, o?) model, where 
u € R! is unknown and o? > 0 is known. Suppose we want to estimate u. The MLE 
of u was computed in Example 6.2.2 to be x. 

In this case, we can determine the sampling distribution of the MLE exactly from 
the results in Section 4.6. We have that ¥ ~ N (u, o? /n) and so_X is unbiased, and 


2 
= = oO 
MSE, (X) = Var, (X) = a 


which is independent of u. So we do not need to estimate the MSE in this case. The 
standard error of the estimate is given by 


A 00 
Sd = —. 
H (x) Jn 
Note that the standard error decreases as the population variance o? decreases and as 
the sample size n increases. E 


EXAMPLE 6.3.2 Bernoulli Model 
Suppose (x1, ...,%,) is a sample from a Bernoulli(@) distribution where 0 e€ [0, 1] is 
unknown. Suppose we wish to estimate 0. The likelihood function is given by 


L(O|x1,...,%n) = 0 (1 -00 | 


and the MLE of 0 is x (Exercise 6.2.2), the proportion of successes in the n perfor- 
mances. We have Eg(X) = 6 for every 8 € [0, 1], so the MLE is an unbiased estimator 
of 0. 

Therefore, 
6d -8@) 

z , 


MSE (X) = Varo (X) = 


and the estimated MSE is 


MSE, (X) = a 


324 Section 6.3: Inferences Based on the MLE 


The standard error of the estimate x is then given by 


Sdj(X) = j-—. 


Note how this standard error is quite different from the standard error of x in Example 
6.3.1.0 


EXAMPLE 6.3.3 Application of the Bernoulli Model 

A polling organization is asked to estimate the proportion of households in the pop- 
ulation in a specific district who will participate in a proposed recycling program by 
separating their garbage into various components. The pollsters decided to take a sam- 
ple ofn = 1000 from the population of approximately 1.5 million households (we will 
say more on how to choose this number later). 

Each respondent will indicate either yes or no to a question concerning their par- 
ticipation. Given that the sample size is small relative to the population size, we can 
assume that we are sampling from a Bernoulli(@) model where @ e [0, 1] is the pro- 
portion of individuals in the population who will respond yes. 

After conducting the sample, there were 790 respondents who replied yes and 210 
who responded no. Therefore, the MLE of 8 is 


and the standard error of the estimate is 


[x (1 —x) a [0.79 (1 — 0.79) =O 
1000 1000 T ` 


Notice that it is not entirely clear how we should interpret the value 0.01288. Does 
it mean our estimate 0.79 is highly accurate, modestly accurate, or not accurate at all? 
We will discuss this further in Section 6.3.2. E 


EXAMPLE 6.3.4 Location-Scale Normal Model 
Suppose that (x1,..., xn) is a sample from an N (u, 07) distribution where u € R! 
and ø? > 0 are unknown. The parameter in this model is given by 0 = (u, a?) E€ 
Q = R! x (0, co). Suppose that we want to estimate u = y(u, 07), i.e., just the first 
coordinate of the full model parameter. 

In Example 6.1.8, we determined that the likelihood function is given by 


n n—l1 
L(u, 07 |x1,...,Xn) = r0)" exp (-s5 (x - w°) exp | — s). 
202 202 


In Example 6.2.6 we showed that the MLE of 0 is 


-1 
(e ==). 
n 


Furthermore, from Theorem 4.6.6, the sampling distribution of the MLE is given by 
X ~ N(u, 07/n) independent of (n — 1) S?/a? ~ y? (n — 1). 


Chapter 6: Likelihood Inference 325 


The plug-in MLE of u is x. This estimator is unbiased and has 


2 
MSE9(X) = Varg(X) = —. 
n 


Since g? is unknown we estimate MSEọ (X) by 


n—l -2 
_ —s n—1l Ss 
MSE (X) = F = N 


The value s? /n is commonly used instead of MSE, (X), because (Corollary 4.6.2) 
Eo(S*) = 0°, 


i.e., S? is an unbiased estimator of o°. The quantity s / a/n is referred to as the standard 
error of the estimate x. E 


EXAMPLE 6.3.5 Application of the Location-Scale Normal Model 

In Example 5.5.6, we have a sample of n = 30 heights (in inches) of students. We 
calculated x = 64.517 as our estimate of the mean population height u. In addition, we 
obtained the estimate s = 2.379 of o . Therefore, the standard error of the estimate x = 
64.517 is s//30 = 2.379//30 = 0.43434. As in Example 6.3.3, we are faced with 
interpreting exactly what this number means in terms of the accuracy of the estimate. E 


Consistency of Estimators 


Perhaps the most important property that any estimator T of a characteristic w (0) can 
have is that it be consistent. Broadly speaking, this means that as we increase the 
amount of data we collect, then the sequence of estimates should converge to the true 
value of w(@). To see why this is a necessary property of any estimation procedure, 
consider the finite population sampling context discussed in Section 5.4.1. When the 
sample size is equal to the population size, then of course we have the full information 
and can compute exactly every characteristic of the distribution of any measurement 
defined on the population. So it would be an error to use an estimation procedure for a 
characteristic of interest that did not converge to the true value of the characteristic as 
we increase the sample size. 

Fortunately, we have already developed the necessary mathematics in Chapter 4 to 
define precisely what we mean by consistency. 


Definition 6.3.3 A sequence of of estimates 7), T2, ... is said to be consistent (in 


probability) for y(@) if Tn as y(@) asn — œ for every 0 € Q. A sequence of 


estimates T1, T2, ... is said to be consistent (almost surely) for w (0) if Tn ae y (0) 
asn — oo for every 0 € Q. 


Notice that Theorem 4.3.1 says that if the sequence is consistent almost surely, then it 
is also consistent in probability. 

Consider now a sample (x1,...,Xn) from a model {fọ : 0 € Q} and let Tp = 
n7! $; x; be the nth sample average as an estimator of y (0) = Eọ(X), which 


326 Section 6.3: Inferences Based on the MLE 


we presume exists. The weak and strong laws of large numbers immediately give us 
the consistency of the sequence T1, T2, ... for y (0). We see immediately that this gives 
the consistency of some of the estimators discussed in this section. In fact, Theorem 
6.5.2 gives the consistency of the MLE in very general circumstances. Furthermore, 
the plug-in MLE will also be consistent under weak restrictions on y. Accordingly, we 
can think of maximum likelihood estimation as doing the right thing in a problem at 
least from the point of view of consistency. 

More generally, we should always restrict our attention to statistical procedures 
that perform correctly as the amount of data increases. Increasing the amount of data 
means that we are acquiring more information and thus reducing our uncertainty so that 
in the limit we know everything. A statistical procedure that was inconsistent would be 
potentially misleading. 


6.3.2 | Confidence Intervals 


While the standard error seems like a reasonable quantity for measuring the accuracy 
of an estimate of y(@), its interpretation is not entirely clear at this point. It turns out 
that this is intrinsically tied up with the idea of a confidence interval. 

Consider the construction of an interval 


C(s) = Us), u(s)), 


based on the data s, that we believe is likely to contain the true value of y(@). To do 
this, we have to specify the lower endpoint /(s) and upper endpoint u(s) for each data 
value s. How should we do this? 

One approach is to specify a probability y € [0, 1] and then require that random 
interval C have the confidence property, as specified in the following definition. 


Definition 6.3.4 An interval C(s) = (/(s), u(s)) is a y-confidence interval for 


y(0) if Pe(w@) € C(s)) = Po(l(s) < w@) < u(s)) > y for every 0 e Q. 
We refer to y as the confidence level of the interval. 


So C is a y-confidence interval for y(@) if, whenever we are sampling from Pg, the 
probability that w (0) is in the interval is at least equal to y . For a given data set, such 
an interval either covers y(@) or it does not. So note that it is not correct to say that 
a particular instance of a y -confidence region has probability y of containing the true 
value of y (0). 

If we choose y to be a value close to 1, then we are highly confident that the 
true value of y (0) is in C(s). Of course, we can always take C(s) = R! (a very big 
interval!), and we are then 100% confident that the interval contains the true value. But 
this tells us nothing we did not already know. So the idea is to try to make use of the 
information in the data to construct an interval such that we have a high confidence, 
say, y = 0.95 or y = 0.99, that it contains the true value and is not any longer than 
necessary. We then interpret the length of the interval as a measure of how accurately 
the data allow us to know the true value of w (0). 


Chapter 6: Likelihood Inference 327 


z-Confidence Intervals 


Consider the following example, which provides one approach to the construction of 
confidence intervals. 
EXAMPLE 6.3.6 Location Normal Model and z-Confidence Intervals 
Suppose we have a sample (x1,...,x,) from the N (u, o?) model, where u € R! is 
unknown and o? > 0 is known. The likelihood function is as specified in Example 
6.3.1. Suppose we want a confidence interval for u. 

The reasoning that underlies the likelihood function leads naturally to the following 
restriction for such a region: If wy € C(x1,..., Xn) and 


L(ug|X1,---5Xn) > Ley |X1,.--5 Xn), 


then we should also have u) € C(x1,...,Xn). This restriction is implied by the like- 
lihood because the model and the data support > at least as well as uw). Thus, if we 
conclude that u; is a plausible value, so is >. 

Therefore, C(x,,...,X,) is of the form 


C1... Xn) = {u : L (u |x1,..-, Xn) 2 k Os pent 


for some k (x1, ..., Xn), 1€., C(@Œ1, ..., Xn) is a likelihood interval for u. Then 


C(x1,..., Xn) = z ; a(z (x - 0) > Ker} 
a) 


n 
= fn Gan e kens] 
205 


2 2 
= fn :@— py < -Hi akn.. 
à È =E Enon) T HEG ER] 


where k* (x1, ..., Xn) = V —2 ln k(x1,..., Xn). 
We are now left to choose k, or equivalently k*, so that the interval C is a y- 
confidence interval for u. Perhaps the simplest choice is to try to choose k* so that 


k*(x1,..., Xn) is constant and is such that the interval as short as possible. Because 
X—u 
Z= ~ N(0,1), (6.3.2) 
oo/Jn ) 
we have 
= 00 = 00 
< P e C(xy,...,Xn)) = Py (X — k* — < u <X+kh* — 
P & u (u (x1 n)) u ( ia <u<At+ 7) 


= P (-« <= <e) =P (2 <e) 
á ~ oo/yn ~ “ \oo/vn]| ~ 
= 1-2(1-(k*)) (6.3.3) 


328 Section 6.3: Inferences Based on the MLE 


for every u € R!, where ® is the N (0, 1) cumulative distribution function. We have 
equality in (6.3.3) whenever i 
* EY 
DE)=— m 
and so k* = z(14y)/2, Where Zą denotes the ath quantile of the N (0, 1) distribution. 
This is the smallest constant k* satisfying (6.3.3). 
We have shown that the likelihood interval given by 
o o 
|: - Zana TP x tzama | (6.3.4) 
is an exact y -confidence interval for u. As these intervals are based on the z-statistic, 
given by (6.3.2), they are called z-confidence intervals. For example, if we take y = 
0.95, then (1 + y ) /2 = 0.975, and, from a statistical package (or Table D.2 in Appen- 
dix D), we obtain zp 975 = 1.96. Therefore, in repeated sampling, 95% of the intervals 
of the form 
- o0 _ 00 
|: 1.96 rg x +1.96 | 
will contain the true value of u. 

This is illustrated in Figure 6.3.3. Here we have plotted the upper and lower end- 
points of the 0.95-confidence intervals for u for each of N = 25 samples of size 
n = 10 generated from an N(0, 1) distribution. The theory says that when N is large, 
approximately 95% of these intervals will contain the true value u = 0. In the plot, 
coverage means that the lower endpoint (denoted by e) must be below the horizontal 
line at 0 and that the upper endpoint (denoted by o) must be above this horizontal line. 
We see that only the fourth and twenty-third confidence intervals do not contain 0, so 
23/25 = 92% of the intervals contain 0. As N — oo, this proportion will converge to 
0.95. 


o 
o o 
Wa 6 o 2 ô 
o o 
2 o T aló 9. o o 
& o o o o 
° ° 
Q o o 
ke] ° e 
BOY eer e Eee OE GE HEE BRR RE Gs ERNE SEY 
= eo e 
a ° ° $ e 
g e e © bs 
g e è ° . 
= . s e š e 
e 
s MEE EE ® ° e 
e 
T T T T T 
0 6 12 18 24 
sample 


Figure 6.3.3: Plot of 0.95-confidence intervals for y = 0 (lower endpoint = e, upper endpoint 
= o) for N = 25 samples of size n = 10 from an N (0, 1) distribution. 


Notice that interval (6.3.4) is symmetrical about x. Accordingly, the half-length of 
this interval, 


oo 
Za+y)/2 Vin’ 


Chapter 6: Likelihood Inference 329 


is a measure of the accuracy of the estimate x. The half-length is often referred to as 
the margin of error. 

From the margin of error, we now see how to interpret the standard error; the stan- 
dard error controls the lengths of the confidence intervals for the unknown u. For ex- 
ample, we know that with probability approximately equal to 1 (actually y = 0.9974), 
the interval [x + 3¢9/,/n] contains the true value of u. E 


Example 6.3.6 serves as a standard example for how confidence intervals are often 
constructed in statistics. Basically, the idea is that we take an estimate and then look at 
the intervals formed by taking symmetrical intervals around the estimate via multiples 
of its standard error. We illustrate this via some further examples. 


EXAMPLE 6.3.7 Bernoulli Model 

Suppose that (x1, ..., Xn) is a sample from a Bernoulli(@) distribution where 8 e€ [0, 1] 
is unknown and we want a y -confidence interval for 9. Following Example 6.3.2, we 
have that the MLE is x (see Exercise 6.2.2) and the standard error of this estimate is 


¥(1—x) 
"eee 


For this model, likelihood intervals take the form 
C(x1,...,%n) = {9:0 (1 -00 > k(x1,...,Xn)} 


for some k(x1,...,X,). Again restricting to constant k, we see that to determine these 
intervals, we have to find the roots of equations of the form 


6* (1 — 6)" = k1, ..., xn). 


While numerical root-finding methods can handle this quite easily, this approach is not 
very tractable when we want to find the appropriate value of k(x1,..., Xn) to give a 
y -confidence interval. 

To avoid these computational complexities, it is common to use an approximate 
likelihood and confidence interval based on the central limit theorem. The central limit 
theorem (see Example 4.4.9) implies that 


asn — oo. Furthermore, a generalization of the central limit theorem (see Section 
4.4.2), shows that 


Z=~ S43, 
/ X (1 -X) 
Therefore, we have 

vn (X - 9) 

PS fel Zanes er = S Zas) 
X(1—X) 

l B ž(1-%) z X (1 -X) 
= dim Po |X- zamn 7 <O SX Hza Se e 


330 Section 6.3: Inferences Based on the MLE 


= [x(1—x) _ /x (1 —x) 
f = Zar ee X + Zay) EE (6.3.5) 


is an approximate y -confidence interval for 0. Notice that this takes the same form as 
the interval in Example 6.3.6, except that the standard error has changed. 

For example, if we want an approximate 0.95-confidence interval for 0 in Example 
6.3.3, then based on the observed x = 0.79, we obtain 


0.79 (1 — 0.79 
0.79 + 1.96,/ De |= [0.76475, 0.81525]. 
1000 


The margin of error in this case equals 0.025245, so we can conclude that we know 
the true proportion with reasonable accuracy based on our sample. Actually, it may be 
that this accuracy is not good enough or is even too good. We will discuss methods for 
ensuring that we achieve appropriate accuracy in Section 6.3.5. 

The y -confidence interval derived here for @ is one of many that you will see rec- 
ommended in the literature. Recall that (6.3.5) is only an approximate y -confidence 
interval for 0, and n may need to be large for the approximation to be accurate. In 
other words, the true confidence level for (6.3.5) will not equal y and could be far from 
that value if n is too small. In particular, if the true 0 is near 0 or 1, then n may need 
to be very large. In an actual application, we usually have some idea of a small range 
of possible values a population proportion @ can take. Accordingly, it is advisable to 
carry out some simulation studies to assess whether or not (6.3.5) is going to provide 
an acceptable approximation for 0 in that range (see Computer Exercise 6.3.21). E 


and 


t-Confidence Intervals 


Now we consider confidence intervals for u in an N (u, 07) model when we drop the 
unrealistic assumption that we know the population variance. 


EXAMPLE 6.3.8 Location-Scale Normal Model and t-Confidence Intervals 

Suppose that (x1, .. . , Xn) is a sample from an N(u, 07) distribution, where u € R! 
and ø > 0 are unknown. The parameter in this model is given by 0 = (u, o°) € Q = 
R! x (0, 00). Suppose we want to form confidence intervals for u = y(u, 07). 

The likelihood function in this case is a function of two variables, u and a, and so 
the reasoning we employed in Example 6.3.6 to determine the form of the confidence 
interval is not directly applicable. In Example 6.3.4, we developed s/,/n as the stan- 
dard error of the estimate x of u. Accordingly, we restrict our attention to confidence 
intervals of the form 


CO ..., Xn) = |: -kpi H 


for some constant k. 


Chapter 6: Likelihood Inference 331 


We then have 
= S - S X-u 
P X—k—=<p<X+k—)=P =k < <k 
e (Z-t su sR a) = Paen ($ S SA) 
X—u 
siala ck) =1-20-0@;n-D), 


where G (-; n — 1) is the distribution function of 


X-u 
t= INA (6.3.6) 


Now, by Theorem 4.6.6, p 
X-u 
a/J/n 


independent of (n — 1) S*/a* ~ x? (n — 1). Therefore, by Definition 4.6.2, 


(X-u n-D)8 Ž-u_ _ 
r= (Fa) r Ae 


k=ta4y)2(-1), 
where tą (A) is the ath quantile of the ¢ (4) distribution, 


~ N(O, 1) 


So if we take 


D S = S 
$ mi MSN ga + tanya @ — YD =| 


is an exact y -confidence interval for u. The quantiles of the ¢ distributions are available 
from a statistical package (or Table D.4 in Appendix D). As these intervals are based 
on the t-statistic, given by (6.3.6), they are called t-confidence intervals. 

These confidence intervals for u tend to be longer than those obtained in Example 
6.3.6, and this reflects the greater uncertainty due to o being unknown. When n = 5, 
then it can be shown that x + 3s/,/n is a 0.97-confidence interval. When we replace s 
by the true value of a, then x + 3a /,/n is a 0.9974-confidence interval. 

As already noted, the intervals x + ks /,/n are not likelihood intervals for u. So the 
justification for using these must be a little different from that given in Example 6.3.6. 
In fact, the likelihood is defined for the full parameter 0 = (u, o°), and it is not entirely 
clear how to extract inferences from it when our interest is in a marginal parameter like 
u. There are a number of different attempts at resolving this issue. Here, however, 
we rely on the intuitive reasonableness of these intervals. In Chapter 7, we will see 
that these intervals also arise from another approach to inference, which reinforces our 
belief that the use of these intervals is appropriate. 

In Example 6.3.5, we have a sample of n = 30 heights (in inches) of students. We 
calculated ¥ = 64.517 as our estimate of u with standard error s//30 = 0.43434. 
Using software (or Table D.4), we obtain to.975 (29) = 2.0452. So a 0.95-confidence 
interval for u is given by 


[64.517 + 2.0452 (0.43434)] = [63.629, 65.405] . 


332 Section 6.3: Inferences Based on the MLE 


The margin of error is 0.888, so we are very confident that the estimate x = 64.517 is 
within an inch of the true mean height. I 


6.3.3 | Testing Hypotheses and P-Values 


As discussed in Section 5.5.3, another class of inference procedures is concerned with 
what we call hypothesis assessment. Suppose there is a theory, conjecture, or hypoth- 
esis that specifies a value for a characteristic of interest y(@), say y (0) = wo. Often 
this hypothesis is written Hp : w(@) = wo and is referred to as the null hypothesis. 

The word null is used because, as we will see in Chapter 10, the value specified 
in Ho is often associated with a treatment having no effect. For example, if we want 
to assess whether or not a proposed new drug does a better job of treating a particular 
condition than a standard treatment does, the null hypothesis will often be equivalent 
to the new drug providing no improvement. Of course, we have to show how this can 
be expressed in terms of some characteristic y(@) of an unknown distribution, and we 
will do so in Chapter 10. 

The statistician is then charged with assessing whether or not the observed s is in ac- 
cord with this hypothesis. So we wish to assess the evidence in s for y(@) = wo being 
true. A statistical procedure that does this can be referred to as a hypothesis assessment, 
a test of significance, or a test of hypothesis. Such a procedure involves measuring how 
surprising the observed s is when we assume Ho to be true. It is clear that s is surprising 
whenever s lies in a region of low probability for each of the distributions specified by 
the null hypothesis, i.e., for each of the distributions in the model for which w(@) = wo 
is true. If we decide that the data are surprising under Hp, then this is evidence against 
Ho. This assessment is carried out by calculating a probability, called a P-value, so that 
small values of the P-value indicate that s is surprising. 

It is important to always remember that while a P-value is a probability, this prob- 
ability is a measure of surprise. Small values of the P-value indicate to us that a sur- 
prising event has occurred if the null hypothesis Ho was true. A large P-value is not 
evidence that the null hypothesis is true. Moreover, a P-value is not the probability that 
the null hypothesis is true. The power of a hypothesis assessment method (see Section 
6.3.6) also has a bearing on how we interpret a P-value. 


z-Tests 
We now illustrate the computation and use of P-values via several examples. 


EXAMPLE 6.3.9 Location Normal Model and the z-Test 

Suppose we have a sample (x1, ..., Xn) from the N (u, o?) model, where u € R! is 
unknown and o? > 0 is known, and we have a theory that specifies a value for the 
unknown mean, say, Ho : u = uo. Note that, by Corollary 4.6.1, when Ho is true, the 
sampling distribution of the MLE is given by ¥ ~ N (uo, o? /n). 

So one method of assessing whether or not the hypothesis Hp makes sense is to 
compare the observed value x with this distribution. If x is in a region of low probabil- 
ity for the N(uo, o? /n) distribution, then this is evidence that Ho is false. Because the 
density of the N (uo, o? /n) distribution is unimodal, the regions of low probability for 


Chapter 6: Likelihood Inference 333 


this distribution occur in its tails. The farther out in the tails x lies, the more surprising 
this will be when Ho is true, and thus the more evidence we will have against Ho. 

In Figure 6.3.4, we have plotted a density of the MLE together with an observed 
value x that lies far in the right tail of the distribution. This would clearly be a surprising 
value from this distribution. 

So we want to measure how far out in the tails of the N (40, o? /n) distribution the 
value x is. We can do this by computing the probability of observing a value of x as 
far, or farther, away from the center of the distribution under Ho as x. The center of 
this distribution is given by “9. Because 


X= mo 
Z = ——= ~ N(0,1 6.3.7 
ala ) (6.3.7) 
under Ho, the P-value is then given by 
= £ X = uo X — Ho 
Py. (|X — > ix -— =: -P ——} 
wo (E= Hol 2 |= Hol) = Pa (E 2 E 


|! (earl) 


where ® denotes the N(0, 1) distribution function. If the P-value is small, then we 
have evidence that x is a surprising value because this tells us that x is out in a tail of 
the N (uo, o? /n) distribution. Because this P-value is based on the statistic Z defined 
in (6.3.7), this is referred to as the z-test procedure. 


density 1.2 F 
1.07 
0.8 7 


0.6 T 


MLE 


Figure 6.3.4: Plot of the density of the MLE in Example 6.3.9 when 4o = 3, o? = 1, and 
n = 10 together with the observed value x = 4.2 (e). 


EXAMPLE 6.3.10 Application of the z-Test 
We generated the following sample of n = 10 from an N (26, 4) distribution. 


29.0651 27.3980 23.4346 26.3665 23.4994 
28.6592 25.5546 29.4477 28.0979 25.2850 


334 Section 6.3: Inferences Based on the MLE 


Even though we know the true value of u, let us suppose we do not and test the hypoth- 
esis Ho : u = 25. To assess this, we compute (using a statistical package to evaluate 


@) the P-value 
2 [i zo (Gas, = =)| 
2//10 


2 (1 — ®(2.6576)) = 0.0078, 


N 
— 
l 
© 
ATN 
af x 
Oo 
a | 
Ss 
JE 
ee 
| i | 
| 


which is quite small. For example, if the hypothesis Hp is correct, then, in repeated 
sampling, we would see data giving a value of x at least as surprising as what we have 
observed only 0.78% of the time. So we conclude that we have evidence against Ho 
being true, which, of course, is appropriate in this case. 

If you do not use a statistical package for the evaluation of ®(2.6576), then you 
will have to use Table D.2 of Appendix D to get an approximation. For example, 
rounding 2.6576 to 2.66, Table D.2 gives ®(2.66) = 0.9961 and the approximate 
P-value is 2 (1 — 0.9961) = 0.0078. In this case, the approximation is exact to four 
decimal places. E 


EXAMPLE 6.3.11 Bernoulli Model 
Suppose that (x1,...,X,) is a sample from a Bernoulli(@) distribution, where 0 e€ 
[0, 1] is unknown, and we want to test Ho : 0 = 0o. As in Example 6.3.7, when Ho is 


true, we have E 
X—90 
F= va (X — 90) D N (0,1) 


V8 (1 — 80) 


asn — oo. So we can test this hypothesis by computing the approximate P-value 


(a AEE =~ (sn) 
when n is large. 


00 (1 — 80) 60 (1 — 40) 

As a specific example, suppose that a psychic claims the ability to predict the value 
of a randomly tossed fair coin. To test this, a coin was tossed 100 times and the psy- 
chic’s guesses were recorded as successes or failures. A total of 54 successes were 
observed. 

If the psychic has no predictive ability, then we would expect the successes to occur 
randomly, just as heads occur when we toss the coin. Therefore, we want to test the 
null hypothesis that the probability 6 of a success occurring is equal to 09 = 1/2. This 
is equivalent to saying that the psychic has no predictive ability. The MLE is 0.54 and 
the approximate P-value is given by 


dheg 100 (0.54 — 0.5) 
v0.5 (1 — 0.5) 


and we would appear to have no evidence that Hp is false, i.e., no reason to doubt that 
the psychic has no predictive ability. E 


)| = 2 (1 — ®(0.8)) = 2 (1 — 0.7881) = 0.4238, 


Often cutoff values like 0.05 or 0.01 are used to determine whether the results 
of a test are significant or not. For example, if the P-value is less than 0.05, then 


Chapter 6: Likelihood Inference 335 


the results are said to be statistically significant at the 5% level. There is nothing 
sacrosanct about the 0.05 level, however, and different values can be used depending on 
the application. For example, if the result of concluding that we have evidence against 
Ho is that something very expensive or important will take place, then naturally we 
might demand that the cutoff value be much smaller than 0.05. 


When Is Statistical Significance Practically Significant? 


It is also important to point out here the difference between statistical significance 
and practical significance. Consider the situation in Example 6.3.9, when the true 
value of u is wy Æ Lg, but u, is so close to wo that, practically speaking, they are 
indistinguishable. By the strong law of large numbers, we have that X > My as 
n — oo and therefore 7 

X — Ho 
oo/y/n l 

X — Ho 


-e5 


We conclude that, if we take a large enough sample size n, we will inevitably conclude 
that u #~ uo because the P-value of the z-test goes to 0. Of course, this is correct 
because the hypothesis is false. 

In spite of this, we do not want to conclude that just because we have statistical sig- 
nificance, the difference between the true value and wg is of any practical importance. 
If we examine the observed absolute difference |x — wo| as an estimate of |u — uol, 
however, we will not make this mistake. If this absolute difference is smaller than some 
threshold 6 that we consider represents a practically significant difference, then even 
if the P-value leads us to conclude that difference exists, we might conclude that no 
difference of any importance exists. Of course, the value of 6 is application dependent. 
For example, in coin tossing, where we are testing 0 = 1/2, we might not care if the 
coin is slightly unfair, say, |@ —@o| < 0.01. In testing the abilities of a psychic, as in Ex- 
ample 6.3.11, however, we might take 6 much lower, as any evidence of psychic powers 
would be an astounding finding. The issue of practical significance is something we 
should always be aware of when conducting a test of significance. 


as. 


This implies that 


Hypothesis Assessment via Confidence Intervals 


Another approach to testing hypotheses is via confidence intervals. For example, if we 
have a y -confidence interval C (s) for y(@) and wo ¢ C(s), then this seems like clear 
evidence against Hp : w(@) = wo, at least when y is close to 1. It turns out that in 
many problems, the approach to testing via confidence intervals is equivalent to using 
P-values with a specific cutoff for the P-value to determine statistical significance. We 
illustrate this equivalence using the z-test and z-confidence intervals. 


336 Section 6.3: Inferences Based on the MLE 


EXAMPLE 6.3.12 4n Equivalence Between z-Tests and z-Confidence Intervals 
We develop this equivalence by showing that obtaining a P-value less than 1 — y for 
Ao : u = ug is equivalent to uo not being in a y -confidence interval for u. Observe 


that 1-7 <2[1-o/ )| 
(Ee) 42 


er ee 
oo/Jn S 4(1+y)/2> 


X — Ho 


oo//n 


if and only if 
X — Lo 


oo//n 


This is true if and only if 


which holds if and only if 


i GY _ 00 
Ho € |x ZAUAN T PINES ip : 


This implies that the y -confidence interval for u comprises those values wo for which 
the P-value for the hypothesis Ho : u = mo is greater than 1 — y. 

Therefore, the P-value, based on the z-statistic, for the null hypothesis Hp : u = 
Ho, will be smaller than 1 — y if and only if wo is not in the y-confidence interval 
for u derived in Example 6.3.6. For example, if we decide that for any P-values less 
than 1 — y = 0.05, we will declare the results statistically significant, then we know 
the results will be significant whenever the 0.95-confidence interval for u does not 
contain uo. For the data of Example 6.3.10, a 0.95-confidence interval is given by 
(25.441, 27.920]. As this interval does not contain uo = 25, we have evidence against 
the null hypothesis at the 0.05 level. 

We can apply the same reasoning for tests about 9 when we are sampling from a 
Bernoulli(@) model. For the data in Example 6.3.11, we obtain the 0.95-confidence 
interval 


_ ¥(1—X) 0.54 (1 — 0.54) 
¥ + 29.975] ——— = 0.54 + 1.96,/ —— ~ = [0.4423 1, 0.63769], 
n 


which includes the value 09 = 0.5. So we have no evidence against the null hypothesis 
of no predictive ability for the psychic at the 0.05 level. E 


t-Tests 


We now consider an example pertaining to the important location-scale normal model. 


EXAMPLE 6.3.13 Location-Scale Normal Model and t-Tests 

Suppose that (x1, .. . , Xn) is a sample from an N (u, 07) distribution, where u € R! 
and ø > 0 are unknown, and suppose we want to test the null hypothesis Ho : u = Uo. 
In Example 6.3.8, we obtained a y -confidence interval for u. This was based on the 


Chapter 6: Likelihood Inference 337 


t-statistic given by (6.3.6). So we base our test on this statistic also. In fact, it can 
be shown that the test we derive here is equivalent to using the confidence intervals to 
assess the hypothesis as described in Example 6.3.12. 
As in Example 6.3.8, we can prove that when the null hypothesis is true, then 
X= mo 
1 6.3.8 
Wi (6.3.8) 
is distributed ż (n — 1). The ¢ distributions are unimodal, with the mode at 0, and the 
regions of low probability are given by the tails. So we test, or assess, this hypothesis 
by computing the probability of observing a value as far or farther away from 0 as 
(6.3.8). Therefore, the P-value is given by 
x X — Ho 
P(ug,02) (ir > ;n— ‘)| ; 


s/Jn )=2[1-6( s/Jn 


where G(-; n — 1) is the distribution function of the t(n — 1) distribution. We then 
have evidence against Ho whenever this probability is small. This procedure is called 
the t-test. Again, it is a good idea to look at the difference |x — wo], when we conclude 
that Ho is false, to determine whether or not the detected difference is of practical 
importance. 

Consider now the data in Example 6.3.10 and let us pretend that we do not know u 
oro”. Then we have * = 26.6808 ands = /4.8620 = 2.2050, so to test Hy : u = 25, 
the value of the t-statistic is 


X — Ho 


, Ex Ho _ 26.6808 -25 _, 4105. 
s/J/n  2.2050//10 


From a statistics package (or Table D.4) we obtain to.975(9) = 2.2622, so we have 
a statistically significant result at the 5% level and conclude that we have evidence 
against Hp : u = 25. Using a statistical package, we can determine the precise value 
of the P-value to be 0.039 in this case. E 


One-Sided Tests 


All the tests we have discussed so far in this section for a characteristic of interest wy (0) 
have been two-sided tests. This means that the null hypothesis specified the value of 
y(@) to be a single value wo. Sometimes, however, we want to test a null hypothesis 
of the form Ho : w(@) < wo or Ho : w(@) > wo. To carry out such tests, we use 
the same test statistics as we have developed in the various examples here but compute 
the P-value in a way that reflects the one-sided nature of the null. These are known as 
one-sided tests. We illustrate a one-sided test using the location normal model. 


EXAMPLE 6.3.14 One-Sided Tests 

Suppose we have a sample (x1,...,x,) from the N (u, o?) model, where u € R! is 
unknown and o? > 0 is known. Suppose further that it is hypothesized that Ho : u < 
Ho is true, and we wish to assess this after observing the data. 


338 Section 6.3: Inferences Based on the MLE 


We will base our test on the z-statistic 


Xpy X-ktu= X-a 4 Ha Ho 
oo/Jn oo/Jn oo/J/n ao/J/n 


So Z is the sum of a random variable having an N(0, 1) distribution and the constant 
~v/n(u — Lo) /o0, which implies that 


ZN e 1). 
(e 


K= 29 


oo//n ~ 


Note that 


if and only if Hp is true. 

This implies that, when the null hypothesis is false, we will tend to see values of 
Z in the right tail of the N (0, 1) distribution; when the null hypothesis is true, we will 
tend to see values of Z that are reasonable for the N (0, 1) distribution, or in the left tail 
of this distribution. Accordingly, to test Ho, we compute the P-value 


X — Ho X — Ho 
( oo/yn ) G o//n ) 
with Z ~ N(0, 1) and conclude that we have evidence against Hp when this is small. 
Using the same reasoning, the P-value for the null hypothesis Hp : u > Mo equals 


X — Ho X — Ho 
P{Z< = ; 
( 7 =a) (2) 
For more discussion of one-sided tests and confidence intervals, see Problems 6.3.25 
through 6.3.32. E 


6.3.4 | Inferences for the Variance 


In Sections 6.3.1, 6.3.2, and 6.3.3, we focused on inferences for the unknown mean of a 
distribution, e.g., when we are sampling from an N (u, o°) distribution or a Bernoulli(@) 
distribution and our interest is in u or 0, respectively. In general, location parameters 
tend to play a much more important role in a statistical analysis than other characteris- 
tics of a distribution. There are logical reasons for this, discussed in Chapter 10, when 
we consider regression models. Sometimes we refer to a parameter such as g? as a nui- 
sance parameter because our interest is in u. Note that the variance of a Bernoulli(@) 
distribution is (1 — 0) so that inferences about 0 are logically inferences about the 
variance too, i.e., there are no nuisance parameters. 

But sometimes we are primarily interested in making inferences about ø? in the 
N (u, o?) distribution when it is unknown. For example, suppose that previous expe- 
rience with a system under study indicates that the true value of the variance is well- 
approximated by ae i.e., the true value does not differ from o? by an amount having 


Chapter 6: Likelihood Inference 339 


any practical significance. Now based on the new sample, we may want to assess the 
hypothesis Ho : o? = o, i.e., we wonder whether or not the basic variability in the 
process has changed. 

The discussion in Section 6.3.1 led to consideration of the standard error s/,/n as 
an estimate of the standard deviation ø /,/n of x. In many ways s? seems like a very 
natural estimator of ø, even when we aren’t sampling from a normal distribution. 

The following example develops confidence intervals and P-values for o°. 


EXAMPLE 6.3.15 Location-Scale Normal Model and Inferences for the Variance 
Suppose that (x1, ... , Xn) is a sample from an N(u, 07) distribution, where u € R! 
ando > 0 are unknown, and we want to make inferences about the population variance 
o°. The plug-in MLE is given by (n — 1) s*/n, which is the average of the squared 
deviations of the data values from +. Often s? is recommended as the estimate because 
it has the unbiasedness property, and we will use this here. An expression can be 
determined for the standard error of this estimate, but, as it is somewhat complicated, 
we will not pursue this further here. 

We can form a y-confidence interval for o? using (n — 1) S/o? ~ y?(n — 1) 
(Theorem 4.6.6). There are a number of possibilities for this interval, but one is to note 
that, letting x2 (4) denote the ath quantile for the y7(A) distribution, then 


@-)S _ 4 
a a Xas — VY 


(n —1) S? (n — 1) S? 
Pino) a5 < o< os) 


sz 
(47)/2 Xa-yy 2% —Y) 


2 
7 = Po?) (0-0 < 


for every (u, o°) € R! x (0, 00). So 


l (n —1)s? (n — 1)s? | 

Xam =D a-pe- D 

is an exact y -confidence interval for ø?. To test a hypothesis such as Ho : o = co at 
the 1 — y level, we need only see whether or not o? is in the interval. The smallest 
value of y such that o? is in the interval is the P-value for this hypothesis assessment 
procedure. 

For the data in Example 6.3.10, let us pretend that we do not know that o? = 4. 
Here, n = 10 and s? = 4.8620. From a statistics package (or Table D.3 in Appendix 
D) we obtain X o (9) = 2.700, Fate (9) = 19.023. So a 0.95-confidence interval 
for ø? is given by 


l (n —1)s2 (n —1)s2 l 7 ES 26862) 
Cisse), 2 nee) 19.023 ° 2.700 
= [2.3003, 16.207]. 


The length of the interval indicates that there is a reasonable degree of uncertainty 
concerning the true value of a2. We see, however, that a test of Ho : c? = 4 would 
not reject this hypothesis at the 5% level because the value 4 is in the 0.95-confidence 
interval. E 


340 Section 6.3: Inferences Based on the MLE 


6.3.5 | Sample-Size Calculations: Confidence Intervals 


Quite often a statistician is asked to determine the sample size n to ensure that with 
a very high probability the results of a statistical analysis will yield definitive results. 
For example, suppose we are going to take a sample of size n from a population IT and 
want to estimate the population mean u so that the estimate is within 0.5 of the true 
mean with probability at least 0.95. This means that we want the half-length, or margin 
of error, of the 0.95-confidence interval for the mean to be guaranteed to be less than 
0.5. 

We consider such problems in the following examples. Note that in general, sample- 
size calculations are the domain of experimental design, which we will discuss more 
extensively in Chapter 10. 

First, we consider the problem of selecting the sample size to ensure that a confi- 
dence interval is shorter than some prescribed value. 


EXAMPLE 6.3.16 The Length of a Confidence Interval for a Mean 

Suppose we are in the situation described in Example 6.3.6, in which we have a sample 
(x1,---,%,) from the N(u, o?) model, with u € R! unknown and o? > 0 known. 
Further suppose that the statistician is asked to determine n so that the margin of error 
for a y -confidence interval for the population mean u is no greater than a prescribed 
value ô > 0. This entails that n be chosen so that 


o0 
Zarna a <6 


z 2 
1 2 
n > of (2222) 


For example, if o? = 10, y = 0.95, and 6 = 0.5, then the smallest possible value for 
n is 154. 

Now consider the situation described in Example 6.3.8, in which we have a sample 
(x1,...,Xp) from the N (u, 07) model with u € R! and o° > 0 both unknown. In this 
case, we want n so that 


or, equivalently, so that 


S 
tarp "= Ye < ô, 


yn 


2 
‘is 2 (ee = 2) l 


which entails 


ô 


But note this also depends on the unobserved value of s, so we cannot determine an 
appropriate value of n. 

Often, however, we can determine an upper bound on the population standard de- 
viation, say, o < b. For example, suppose we are measuring human heights in cen- 
timeters. Then we have a pretty good idea of upper and lower bounds on the possible 
heights we will actually obtain. Therefore, with the normality assumption, the interval 
given by the population mean, plus or minus three standard deviations, must be con- 
tained within the interval given by the upper and lower bounds. So dividing the length 


Chapter 6: Likelihood Inference 341 


of this interval by 6 gives a plausible upper bound b for the value of ø. In any case, 
when we have such an upper bound, we can expect that s < b, at least if we choose b 
conservatively. Therefore, we take n to satisfy 


2 
a (=: n- 2) 
> ee 


Note that we need to evaluate ¢(14)/2 (n — 1) for each n as well. It is wise to be fairly 
conservative in our choice of n in this case, i.e., do not choose the smallest possible 
value. E 


EXAMPLE 6.3.17 The Length of a Confidence Interval for a Proportion 

Suppose we are in the situation described in Example 6.3.2, in which we have a sample 
(x1, -.-, Xn) from the Bernoulli(@) model and @ e [0, 1] is unknown. The statistician 
is required to specify the sample size n so that the margin of error of a y -confidence 
interval for 0 is no greater than a prescribed value ô. So, from Example 6.3.7, we want 


n to satisfy 
Ix (1 —x 
Z(l+y)/2 ae 7 l < ô, (6.3.9) 


Because this also depends on the unobserved x, we cannot determine n. Note, however, 
that 0 < x (1 — x) < 1/4 for every x (plot this function) and that this upper bound is 
achieved when x = 1/2. Therefore, if we determine n so that 


2 
sae (22) ; 
SAN d 


then we know that (6.3.9) is satisfied. For example, if y = 0.95, 6 = 0.1, the smallest 
possible value of n is 97; if y = 0.95, ô = 0.01, the smallest possible value of n is 
9604. I 


and this entails 


6.3.6 | Sample-Size Calculations: Power 


Suppose the purpose of a study is to assess a specific hypothesis Ho : y (0) = wo and 
it is has been decided that the results will be declared statistically significant whenever 
the P-value is less than a. Suppose that the statistician is asked to choose n, so that 
the P-value obtained is smaller than a, with probability at least 6), at some specific 
0ı such that y(@1) 4 wo. The probability that the P-value is less than a for a specific 
value of 0 is called the power of the test at 0. We will denote this by (0) and call 6 
the power function of the test. The notation £ is not really complete, as it suppresses 
the dependence of f on y, wo, a,n, and the test procedure, but we will assume that 
these are clear in a particular context. The problem the statistician is presented with 
can then be stated as: Find n so that 8(@1) > Bo. 

The power function of a test is a measure of the sensitivity of the test to detect 
departures from the null hypothesis. We choose a small (a = 0.05, 0.01, etc.) so that 


342 Section 6.3: Inferences Based on the MLE 


we do not erroneously declare that we have evidence against the null hypothesis when 
the null hypothesis is in fact true. When w(9) Æ wo, then £(@) is the probability that 
the test does the right thing and detects that Ho is false. 

For any test procedure, it is a good idea to examine its power function, perhaps 
for several choices of a, to see how good the test is at detecting departures. For it 
can happen that we do not find any evidence against a null hypothesis when it is false 
because the sample size is too small. In such a case, the power will be small at 0 values 
that represent practically significant departures from Ho. To avoid this problem, we 
should always choose a value y4 that represents a practically significant departure from 
yo and then determine n so that we reject Ho with high probability when y(@) = y,. 

We consider the computation and use of the power function in several examples. 


EXAMPLE 6.3.18 The Power Function in the Location Normal Model 
For the two-sided z-test in Example 6.3.9, we have 
X — uo 


wren fe-o(A)] <9 
X> a 
ED -eli 
Pe (R ea) 8 (GTA <- a 
BN IJn (l-a/2) AF TA (l—a/2) 
as Ho — =H MoT 
= (Cae side tzaa) + AC oo/Jn - 20-02) 


Ho = Ho = 
=1- = -z(a 3. 
o (a4 iGo m) + o(é T Zq- a»). (6.3.10) 


Notice that 


X-u 


Il 
Y 


Pu) = p (uo + (u — Ho) = £ (uo — (u — Ho); 


so $ is symmetric about uo (put ô = u — ug and u = uo + ô in the expression for 
(u) and we get the same value). 
Differentiating (6.3.10) with respect to „/n, we obtain 


E (= -20-4/9) -9 (= Ta + Z(1- «)| m Z (6.3.11) 


where ø is the density of the N (0, 1) distribution. We can establish that (6.3.11) is 
always nonnegative (see Challenge 6.3.34). This implies that (u) is increasing in 
n, so we need only solve f(u) = fo for n (the solution may not be an integer) to 
determine a suitable sample size (all larger values of n will give a larger power). 

For example, when o9 = 1,a = 0.05, fọ = 0.99, and u; = uo + 0.1, we must 
find n satisfying 


1 — (Vn (0.1) + 1.96) + ®(./n(0.1) — 1.96) = 0.99. (6.3.12) 


(Note that the symmetry of £ about uo means we will get the same answer if we use 
Ho — 0.1 here instead of wo + 0.1.) Tabulating (6.3.12) as a function of n using a 


Chapter 6: Likelihood Inference 343 


statistical package determines that n = 785 is the smallest value achieving the required 
bound. 
Also observe that the derivative of (6.3.10) with respect to u is given by 


peze) EE cm 


This is positive when u > fo, negative when u < fo, and takes the value 0 when 
u = Lo (see Challenge 6.3.35). From (6.3.10) we have that (u) > las u => +oo. 
These facts establish that £ takes its minimum value at wo and that it is increasing as 
we move away from uo. Therefore, once we have determined n so that the power is at 
least By at some “1, we know that the power is at least fọ for all values of u satisfying 
luo — Hl 2 [xo — xil. 

As an example of this, consider Figure 6.3.5, where we have plotted the power 
function when n = 10, uo = 0,00 = 1, and a = 0.05 so that 


Bu) = 1 — ©(V/10u + 1.96) + (10 pu — 1.96). 


Notice the symmetry about uo = 0 and the fact that 6 (u) increases as u moves away 
from 0. We obtain 6(1.2) = 0.967 so that when u = 1.2, the probability that the 
P-value for testing Hp : u = 0 will be less than 0.05 is 0.967. Of course, as we increase 
n, this graph will rise even more steeply to 1 as we move away from 0. 


power 
£ AA g ¢ k 
& § 
L | 
oa 


T T T 
5 0 5 


Figure 6.3.5: Plot of the power function f (u) for Example 6.3.18 when a = 0.05, wo = 0, 
and go = 1 is assumed known. 


Many statistical packages contain the power function as a built-in function for var- 
ious tests. This is very convenient for examining the sensitivity of the test and deter- 
mining sample sizes. E 


EXAMPLE 6.3.19 The Power Function for 0 in the Bernoulli Model 
For the two-sided test in Example 6.3.11, we have that the power function is given by 


p= m2 Í - of JAX =O) )<«]) 


00 (1 — 00) 


344 Section 6.3: Inferences Based on the MLE 


Under the assumption that we choose n large enough so that Y is approximately dis- 
tributed N (0, 0 (1 — 0) /n), the approximate calculation of this power function can be 
approached as in Example 6.3.18, when we put o9 = 0 (1 — 0). We do not pursue 
this calculation further here but note that many statistical packages will evaluate J as a 
built-in function. E 


EXAMPLE 6.3.20 The Power Function in the Location-Scale Normal Model 
For the two-sided t-test in Example 6.3.13, we have 
X — mo 
B,(u, 0°) = P(u,02) (2 — o( DAZ ; n— 1) < «) 
X — uo 
P(u,0?) (2 > t1-a/2)(n — D), 


where G (-; n — 1) is the cumulative distribution function of the t(n — 1) distribution. 
Notice that it is a function of both u and o?. In particular, we have to specify both 
u and g? and then determine n so that £,,(u,07) > Bo. Many statistical packages 
will have the calculation of this power function built-in so that an appropriate n can be 
determined using this. Alternatively, we can use Monte Carlo methods to approximate 
the distribution function of 


X = no 
S/J/n 


when sampling from the N(u, 07), for a variety of values of n, to determine an appro- 
priate value. E 


Summary of Section 6.3 


e The MLE Ô is the best-supported value of the parameter 8 by the model and 
data. As such, it makes sense to base the derivation of inferences about some 
characteristic y(@) on the MLE. These inferences include estimates and their 
standard errors, confidence intervals, and the assessment of hypotheses via P- 
values. 


e An important aspect of the design of a sampling study is to decide on the size n 
of the sample to ensure that the results of the study produce sufficiently accurate 
results. Prescribing the half-lengths of confidence intervals (margins of error) or 
the power of a test are two techniques for doing this. 


EXERCISES 


6.3.1 Suppose measurements (in centimeters) are taken using an instrument. There 
is error in the measuring process and a measurement is assumed to be distributed 
N(u, os) where u is the exact measurement and o? = 0.5. If the (n = 10) measure- 
ments 4.7, 5.5, 4.4, 3.3, 4.6, 5.3, 5.2, 4.8, 5.7, 5.3 were obtained, assess the hypothesis 
Ho : u = 5 by computing the relevant P-value. Also compute a 0.95-confidence 
interval for the unknown w. 


Chapter 6: Likelihood Inference 345 


6.3.2 Suppose in Exercise 6.3.1, we drop the assumption that o? = 0.5. Then assess 
the hypothesis Ho : u = 5 and compute a 0.95-confidence interval for u. 

6.3.3 Marks on an exam in a statistics course are assumed to be normally distributed 
with unknown mean but with variance equal to 5. A sample of four students is selected, 
and their marks are 52, 63, 64, 84. Assess the hypothesis Ho : u = 60 by computing 
the relevant P-value and compute a 0.95-confidence interval for the unknown u. 

6.3.4 Suppose in Exercise 6.3.3 that we drop the assumption that the population vari- 
ance is 5. Assess the hypothesis Ho : u = 60 by computing the relevant P-value and 
compute a 0.95-confidence interval for the unknown yu. 

6.3.5 Suppose that in Exercise 6.3.3 we had observed only one mark and that it was 
52. Assess the hypothesis Ho : u = 60 by computing the relevant P-value and compute 
a 0.95-confidence interval for the unknown wu. Is it possible to compute a P-value and 
construct a 0.95-confidence interval for u without the assumption that we know the 
population variance? Explain your answer and, if your answer is no, determine the 
minimum sample size n for which inference is possible without the assumption that 
the population variance is known. 

6.3.6 Assume that the speed of light data in Table 6.3.1 is a sample from an N (u, o?) 
distribution for some unknown values of u and o°. Determine a 0.99-confidence inter- 
val for u. Assess the null hypothesis Ho : u = 24. 

6.3.7 A manufacturer wants to assess whether or not rods are being constructed appro- 
priately, where the diameter of the rods is supposed to be 1.0 cm and the variation in the 
diameters is known to be distributed N (u, 0.1). The manufacturer is willing to tolerate 
a deviation of the population mean from this value of no more than 0.1 cm, Le., if the 
population mean is within the interval 1.0 +0.1 cm, then the manufacturing process is 
performing correctly. A sample of n = 500 rods is taken, and the average diameter 
of these rods is found to be x = 1.05 cm, with s? = 0.083 cm?. Are these results 
statistically significant? Are the results practically significant? Justify your answers. 
6.3.8 A polling firm conducts a poll to determine what proportion @ of voters in a given 
population will vote in an upcoming election. A random sample of n = 250 was taken 
from the population, and the proportion answering yes was 0.62. Assess the hypothesis 
Ho : 0 = 0.65 and construct an approximate 0.90-confidence interval for 0. 

6.3.9 A coin was tossed n = 1000 times, and the proportion of heads observed was 
0.51. Do we have evidence to conclude that the coin is unfair? 

6.3.10 How many times must we toss a coin to ensure that a 0.95-confidence interval 
for the probability of heads on a single toss has length less than 0.1, 0.05, and 0 .01, 
respectively? 

6.3.11 Suppose a possibly biased die is rolled 30 times and that the face containing 
two pips comes up 10 times. Do we have evidence to conclude that the die is biased? 
6.3.12 Suppose a measurement on a population is assumed to be distributed N (u, 2) 
where u € R! is unknown and that the size of the population is very large. A researcher 
wants to determine a 0.95-confidence interval for u that is no longer than 1. What is 
the minimum sample size that will guarantee this? 

6.3.13 Suppose (x1, ..., Xn) isa sample from a Bernoulli(@) with 0 e [0, 1] unknown. 
(a) Show that 77_, (x; — ¥)* = nx (1 — X). (Hint: x? = x;.) 


346 Section 6.3: Inferences Based on the MLE 


(b) If X ~ Bemoulli(@), then c? = Var(X) = (1 — 0). Record the relationship 
between the plug-in estimate of o? and that given by s? in (5.5.5). 

(c) Since s? is an unbiased estimator of g? (see Problem 6.3.23), use the results in part 
(b) to determine the bias in the plug-in estimate. What happens to this bias as n 4 00? 
6.3.14 Suppose you are told that, based on some data, a 0.95-confidence interval for 
a characteristic y(@) is given by (1.23, 2.45). You are then asked if there is any evi- 
dence against the hypothesis Hp : w(@) = 2. State your conclusion and justify your 
reasoning. 

6.3.15 Suppose that x; is a value from a Bernoulli(@) with @ € [0, 1] unknown. 

(a) Is xı an unbiased estimator of 0? 

(b) Is x? an unbiased estimator of 67? 

6.3.16 Suppose a plug-in MLE of a characteristic wy (0) is given by 5.3. Also a P-value 
was computed to assess the hypothesis Hp : y(@) = 5 and the value was 0.000132. If 
you are told that differences among values of y(@) less than 0.5 are of no importance 
as far as the application is concerned, then what do you conclude from these results? 
Suppose instead you were told that differences among values of y(@) less than 0.25 
are of no importance as far as the application is concerned, then what do you conclude 
from these results? 

6.3.17 A P-value was computed to assess the hypothesis Ho : y(@) = 0 and the value 
0.22 was obtained. The investigator says this is strong evidence that the hypothesis is 
correct. How do you respond? 

6.3.18 A P-value was computed to assess the hypothesis Ho : y(@) = 1 and the value 
0.55 was obtained. You are told that differences in y (0) greater than 0.5 are considered 
to be practically significant but not otherwise. The investigator wants to know if enough 
data were collected to reliably detect a difference of this size or greater. How would 
you respond? 


COMPUTER EXERCISES 


6.3.19 Suppose a measurement on a population can be assumed to follow the N (u, c?) 
distribution, where (u, 07) € R! x (0, 00) is unknown and the size of the population is 
very large. A very conservative upper bound on ø is given by 5. A researcher wants to 
determine a 0.95-confidence interval for u that is no longer than 1. Determine a sample 
size that will guarantee this. (Hint: Start with a large sample approximation.) 

6.3.20 Suppose a measurement on a population is assumed to be distributed N (u, 2), 
where u € R! is unknown and the size of the population is very large. A researcher 
wants to assess a null hypothesis Ho : u = mo and ensure that the probability is at 
least 0.80 that the P-value is less than 0.05 when u = uo £0.5. What is the minimum 
sample size that will guarantee this? (Hint: Tabulate the power as a function of the 
sample size n.) 


6.3.21 Generate 10° samples of size n = 5 from the Bernoulli(0.5) distribution. For 
each of these samples, calculate (6.3.5) with y = 0.95 and record the proportion of 
intervals that contain the true value. What do you notice? Repeat this simulation with 
n = 20. What do you notice? 


Chapter 6: Likelihood Inference 347 


6.3.22 Generate 10+ samples of size n = 5 from the N(0, 1) distribution. For each of 
these samples, calculate the interval (¥ —s /./5, x +s/+/5), where s is the sample stan- 
dard deviation, and compute the proportion of times this interval contains u. Repeat 
this simulation with n = 10 and 100 and compare your results. 


PROBLEMS 


6.3.23 Suppose that (x1,..., Xn) is a sample from a distribution with mean „u and 
variance o°. 


(a) Prove that s? given by (5.5.5) is an unbiased estimator of o°. 


(b) If instead we estimate o? by (n — 1)s?/n, then determine the bias in this estimate 
and what happens to it as n > oo. 

6.3.24 Suppose we have two unbiased estimators Tı and Tù of y (0) € R!. 

(a) Show that a7; + (1 — a)T) is also an unbiased estimator of y(@) whenever a € 
[0, 1]. 

(b) If Tı and 7> are also independent, e.g., determined from independent samples, then 
calculate Varg (a Tı + (1 — a)7>) in terms of Varg(71) and Varg (7). 

(c) For the situation in part (b), determine the best choice of a in the sense that for this 
choice Varo (a Tı +(1 —a)7) is smallest. What is the effect on this combined estimator 
of 7; having a very large variance relative to 72? 

(d) Repeat parts (b) and (c), but now do not assume that 7; and Tz are independent, so 
Varo (aT, + (1 — a )T2) will also involve Covọ (T1, T2). 

6.3.25 (One-sided confidence intervals for means) Suppose that (x1, ..., Xn) is a sam- 
ple from an N (u, o?) distribution, where u € R! is unknown and o? is known. Sup- 
pose we want to make inferences about the interval y(u) = (—oo, u). Consider the 
problem of finding an interval C (x1, ..., Xn) = (—00, u(x1, ...,Xn)) that covers the 
interval (—oo, u) with probability at least y . So we want u such that for every u, 


Palu < u(Xi,..., Xn)) > 7. 


Note that (—00, u) C (—00, u(x1, ...,Xn)) if and only if u < u(x1,...,Xn), SO 
C (x1, ..., Xn) is called a left-sided y -confidence interval for u. Obtain an exact left- 
sided y -confidence interval for u using u(x1, ..., Xn) =X +k(o0/./n), i.e., find the 
k that gives this property. 

6.3.26 (One-sided hypotheses for means ) Suppose that (x1, ... , Xn) is a sample from 
a N(u, o2) distribution, where u is unknown and o? is known. Suppose we want 
to assess the hypothesis Ho : uw < uo. Under these circumstances, we say that the 
observed value x is surprising if x occurs in a region of low probability for every 
distribution in Ho. Therefore, a sensible P-value for this problem is max pem Pu (X > 
X). Show that this leads to the P-value 1 — ®((% — w)/(o0/./7)). 

6.3.27 Determine the form of the power function associated with the hypothesis assess- 
ment procedure of Problem 6.3.26, when we declare a test result as being statistically 
significant whenever the P-value is less than a. 

6.3.28 Repeat Problems 6.3.25 and 6.3.26, but this time obtain a right-sided y -confidence 
interval for u and assess the hypothesis Ho : u > uo. 


348 Section 6.3: Inferences Based on the MLE 


6.3.29 Repeat Problems 6.3.25 and 6.3.26, but this time do not assume the population 
variance is known. In particular, determine k so that u (x1,...,%n) = ¥ +k (s/J/n) 
gives an exact left-sided y-confidence interval for u and show that the P-value for 
testing Hp : u < uo is given by 


X — Ho 
l-G ;n—l)}). 
(G Jn ) 
6.3.30 (One-sided confidence intervals for variances) Suppose that (x1,...,Xy) is a 


sample from the N (u, 07) distribution, where (u, 07) € R! x (0, 00) is unknown, and 
we want a y -confidence interval of the form 


C(x1,..., Xn) = (0,u@1,...,%Xn)) 


for o?. If u(x1,..., Xn) = ks?, then determine k so that this interval is an exact y- 
confidence interval. 

6.3.31 (One-sided hypotheses for variances) Suppose that (x1, ...,Xn) is a sample 
from the N (u, g?) distribution, where (u, g?) € R! x (0, 00) is unknown, and we 
want to assess the hypothesis Hy : o? < Ge Argue that the sample variance s? is 
surprising if s? is large and that, therefore, a sensible P-value for this problem is to 
compute Max(,, 52)eH) Pu (S2 > s?). Show that this leads to the P-value 


—1)s2 
1-4( 2" sn), 
a) 


where H(-; n — 1) is the distribution function of the y? (n — 1) distribution. 

6.3.32 Determine the form of the power function associated with the hypothesis as- 
sessment procedure of Problem 6.3.31, for computing the probability that the P-value 
is less than a. 

6.3.33 Repeat Exercise 6.3.7, but this time do not assume that the population variance 
is known. In this case, the manufacturer deems the process to be under control if the 
population standard deviation is less than or equal to 0.1 and the population mean is in 
the interval 1.0 +0.1 cm. Use Problem 6.3.31 for the test concerning the population 
variance. 


CHALLENGES 


6.3.34 Prove that (6.3.11) is always nonnegative. (Hint: Use the facts that g is sym- 
metric about 0, increases to the left of 0, and decreases to the right of 0.) 


6.3.35 Establish that (6.3.13) is positive when u > mo, negative when u < uo, and 
takes the value 0 when u = Uo. 


DISCUSSION TOPICS 


6.3.36 Discuss the following statement: The accuracy of the results of a statistical 
analysis is so important that we should always take the largest possible sample size. 


Chapter 6: Likelihood Inference 349 


6.3.37 Suppose we have a sequence of estimators Ti, 7>,... for w(@) and Tn 5 
y(0) + €(@) asn — œ for each 0 e Q. Discuss under what circumstances you 
might consider T, a useful estimator of w (0). 


6.4 | Distribution-Free Methods 


The likelihood methods we have been discussing all depend on the assumption that the 
true distribution lies in {Pg : 0 € Q}. There is typically nothing that guarantees that 
the assumption {Pg : 0 € Q} is correct. If the distribution we are sampling from is far 
different from any of the distributions in {Pg : 0 € Q}, then methods of inference that 
depend on this assumption, such as likelihood methods, can be very misleading. So 
it is important in any application to check that our assumptions make sense. We will 
discuss the topic of model checking in Chapter 9. 

Another approach to this problem is to take the model {Pg : 0 € Q} as large as 
possible, reflecting the fact that we may have very little information about what the 
true distribution is like. For example, inferences based on the Bernoulli (0) model with 
0 e Q = (0, 1] really specify no information about the true distribution because this 
model includes all the possible distributions on the sample space S = {0, 1}. Infer- 
ence methods that are suitable when {Pg : 0 € Q} is very large are sometimes called 
distribution-free, to reflect the fact that very little information is specified in the model 
about the true distribution. 

For finite sample spaces, it is straightforward to adopt the distribution-free ap- 
proach, as with the just cited Bernoulli model, but when the sample space is infinite, 
things are more complicated. In fact, sometimes it is very difficult to determine infer- 
ences about characteristics of interest when the model is very big. Furthermore, if we 
have 


{Po :0 ce Qi} c {Po :0€Q}, 


then, when the smaller model contains the true distribution, methods based on the 
smaller model will make better use of the information in the data about the true value 
in Q; than will methods using the bigger model {Pg : 0 € Q}. So there is a trade-off 
between taking too big a model and taking too precise a model. This is an issue that a 
statistician must always address. 

We now consider some examples of distribution-free inferences. In some cases, the 
inferences have approximate sampling properties, while in other cases the inferences 
have exact sampling properties for very large models. 


6.4.1 | Method of Moments 


Suppose we take {Py : 0 € Q} to be the set of all distributions on R! that have their 
first / moments, and we want to make inferences about the moments 


Hj = Eo(X'), 


350 Section 6.4: Distribution-Free Methods 


fori = 1,...,/ based on a sample (x1,...,x,). The natural sample analog of the 
population moment u; is the ith sample moment 


n 
L5 i 

Mmi = — Xi, 
no 
j=l 


which would seem to be a sensible estimator. 

In particular, we have that Eg (Mj) = u; for every 0 € Q, so m; is unbiased, 
and the weak and strong laws of large numbers establish that m; converges to u; as n 
increases. Furthermore, the central limit theorem establishes that 


Mi— Hi D 
— i + N(0,1) 
„Varg (Mi) 
asn — œœ, provided that Varg(Mj) < oo. Now, because X1,..., Xn are i.i.d., we 


have that 


j ; 1 : 1 l 
Varg (Mi) = > Varo (Xj) = = Varg (X1) = -Eo (X; = u’) 
j=1 


1 3 ; 1 
= ExT — 2u; Xi +u?) = 7, (Hi — 43), 


so we have that Varg(M;) < oo, provided that i < //2. In this case, we can estimate 
Haj — u? by 
2 1 5 j 2 
Si “ET mi) > 


as we can simply treat Gs ...,x/) as a sample from a distribution with mean py; 
and variance uz; — ie Problem 6.3.23 establishes that s? is an unbiased estimate of 
Varg(M;). So, as with inferences for the population mean based on the z-statistic, we 
have that 


Si 
mi =A Te 


is an approximate y -confidence interval for u; whenever i < //2 and n is large. Also, 
we can test hypothesis Hp : u; = uio in exactly the same fashion, as we did this for 
the population mean using the z-statistic. 

Notice that the model {Pg : 6 € Q} is very large (all distributions on R! having their 
first 7/2 moments finite), and these approximate inferences are appropriate for every 
distribution in the model. A cautionary note is that estimation of moments becomes 
more difficult as the order of the moments rises. Very large sample sizes are required 
for the accurate estimation of high-order moments. 

The general method of moments principle allows us to make inference about char- 
acteristics that are functions of moments. This takes the following form: 


Method of moments principle: A function y(“1,..., ug) of the first k < / 
moments is estimated by w (m1, ..., mx). 


Chapter 6: Likelihood Inference 351 


When w is continuously differentiable and nonzero at (41, ..., 4g), and k <//2, then 
it can be proved that w (M1, ..., Mp) converges in distribution to a normal with mean 
given by y(,,..., uk) and variance given by an expression involving the variances 
and covariances of M),..., Mp and the partial derivatives of w. We do not pursue this 
topic further here but note that, in the case k = 1 and / = 2, these conditions lead to 
the so-called delta theorem, which says that 


Vn (y(X) = wu) 

lws 
asn — oo, provided that y is continuously differentiable at u; and y’(u,) Æ 0; see 
Approximation Theorems of Mathematical Statistics, by R. J. Serfling (John Wiley & 


Sons, New York, 1980), for a proof of this result. This result provides approximate 
confidence intervals and tests for w (u1). 


EXAMPLE 6.4.1 Inference about a Characteristic Using the Method of Moments 
Suppose (x1,...,X,) is a sample from a distribution with unknown mean u and vari- 
ance o~, and we want to construct a y -confidence interval for y(u) = 1/u?. Then 
yw’ (u) = —2/ w? , So the delta theorem says that 


y2 _ 2 
SAUA ME) By 
2s /X3 


2 NO,1) (6.4.1) 


as n — oo. Therefore, 


i? s 
SEN eS 
(=) pie es 


is an approximate y -confidence interval for y(u) = 1/7. 

Notice that if u = 0, then this confidence interval is not valid because y is not 
continuously differentiable at 0. So if you think the population mean could be 0, or 
even close to 0, this would not be an appropriate choice of confidence interval for y. E 


6.4.2 | Bootstrapping 


Suppose that {P4 : 0 € Q} is the set of all distributions on R! and that (x1, ..., Xn) is 
a sample from some unknown distribution with cdf Fg. Then the empirical distribution 
function 


z g 
sd Nast > Ico, i), 
i=l 


introduced in Section 5.4.1, is a natural estimator of the cdf F(x). 
We have 


A 1Y 1L 
Eo (Ê (x) = = X Eolo] (Xi) = =D Fae) = Fa(x) 
i=l i=l 


for every 0 € Q so that F is unbiased for Fg. The weak and strong laws of large 
numbers then establish the consistency of F' (x) for Fg(x) asn — oo. Observing that 


352 Section 6.4: Distribution-Free Methods 


the [(_o0,x] (47) constitute a sample from the Bernoulli(F9(x)) distribution, we have 
that the standard error of F (x) is given by 


F(x)(1 — F@)) 
{—— 


These facts can be used to form approximate confidence intervals and test hypotheses 
for F(x), just as in Examples 6.3.7 and 6.3.11. 

Observe that Ê (x) prescribes a distribution on the set {x1, . . . , Xn}, e.g., if the sam- 
ple values are distinct, this probability distribution puts mass 1/n on each x;. Note that 
it is easy to sample a value from Ê , as we just select a value from {x1, ..., Xn} where 
each point has probability 1/n of occurring. When the x; are not distinct, then this is 
changed in an obvious way, namely, x; has probability f;/n, where f; is the number of 
times x; Occurs in x1, ..., Xn. 

Suppose we are interested in estimating y(@) = T(F@), where T is a function 
of the distribution Fg. We use this notation to emphasize that w(@) corresponds to 
some characteristic of the distribution rather than just being an arbitrary mathematical 
function of 0. For example, T (Fg) could be a moment of Fg, a quantile of Fg, etc. 

Now suppose we have an estimator w (x1, ..., Xn) that is being proposed for in- 
ferences about y (0). Naturally, we are interested in the accuracy of y, and we could 
choose to measure this by 


K A 2 A 
MSEq() = (Eo(w) — vO) + Varo). (6.4.2) 


Then, to assess the accuracy of our estimate w(x1, ..+,Xn), we need to estimate (6.4.2). 

When n is large, we expect Ê to be close to Fg, so a natural estimate of y (0) is 
T(F ), i.e., simply compute the same characteristic of the empirical distribution. This 
is the approach adopted in Chapter 5 when we discussed descriptive statistics. Then 
we estimate the square of the bias in y by 


(yw —T(F)). (6.4.3) 


To estimate the variance of y, we use 
A ao aA 
Var a(y) = Epl )— Env) 


1 u n a2 1 n n 7 2 
= Pen nnd (Fe HS Hurd : (6.4.4) 
i=l ial 


in=l in=l 


i.e., we treat x1,...,X, as iid. random values with cdf given by F. So to calculate 
an estimate of (6.4.2), we simply have to calculate Var ply). This is rarely feasible, 
however, because the sums in (6.4.4) involve n” terms. For even very modest sample 
sizes, like n = 10, this cannot be carried out, even on a computer. 

The solution to this problem is to approximate (6.4.4) by drawing m indepen- 
dent samples of size n from Ê, evaluating y for each of these samples to obtain 


Chapter 6: Likelihood Inference 353 


Wis snag m» and then using the sample variance 
2 
OO een IS (6.4.5) 
ara = —<—— . — — P 4. 
FY aci Z "i ml” 


as the estimate. The m samples from Ê are referred to as bootstrap samples or re- 
samples, and this technique is referred to as bootstrapping or resampling. Combining 
(6.4.3) and (6.4.5) gives an estimate of MSEg(y). Furthermore, m7! 1 y; is called 


the bootstrap mean, and 
y Var e(v) 


is the bootstrap standard error. Note that the bootstrap standard error is a valid estimate 
of the error in y whenever y has little or no bias. 
Consider the following example. 


EXAMPLE 6.4.2 The Sample Median as an Estimator of the Population Mean 
Suppose we want to estimate the location of a unimodal, symmetric distribution. While 
the sample mean might seem like the obvious choice for this, it turns out that for some 
distributions there are better estimators. This is because the distribution we are sam- 
pling may have long tails, i.e., may produce extreme values that are far from the center 
of the distribution. This implies that the sample average itself could be highly influ- 
enced by a few extreme observations and would thus be a poor estimate of the true 
mean. 

Not all estimators suffer from this defect. For example, if we are sampling from a 
symmetric distribution, then either the sample mean or the sample median could serve 
as an estimator of the population mean. But, as we have previously discussed, the 
sample median is not influenced by extreme values, i.e., it does not change as we move 
the smallest (or largest) values away from the rest of the data, and this is not the case 
for the sample mean. 

A problem with working with the sample median *,5, rather than the sample mean 
X, is that the sampling distribution for £o.s is typically more difficult to study than 
that of x. In this situation, bootstrapping becomes useful. If we are estimating the 
population mean T (Fg) by using the sample median (which is appropriate when we 
know the distribution we were sampling from is symmetric), then the estimate of the 
squared bias in the sample median is given by 


(yw — TAY = (fos - 3)? 


because w = £05 and T(F) = X (the mean of the empirical distribution is ¥). This 
should be close to 0, or else our assumption of a symmetric distribution would seem 
to be incorrect. To calculate (6.4.5), we have to generate m samples of size n from 
{x1,...,Xn} (with replacement) and calculate 9.5 for each sample. 

To illustrate, suppose we have a sample of size n = 15, given by the following 


table. 
—2.0 —0.2 —5.2 -3.5 —3.9 
—0.6 —4.3 —1.7 —9.5 1.6 


—2.9 0.9 —1.0 -2.0 3.0 


354 Section 6.4: Distribution-Free Methods 


Then, using the definition of £o.5 given by (5.5.4) (denoted X05 there), w = —2.000 
and x = —2.087. The estimate of the squared bias (6.4.3) equals (—2.000 + 2.087)? = 
7.569 x 10-3, which is appropriately small. Using a statistical package, we generated 
m = 10° samples of size n = 15 from the distribution that has probability 1/15 at each 
of the sample points and obtained 


Var p(y) = 0.770866. 
Based on m = 104 samples, we obtained 

Var p(w) = 0.718612, 
and based on m = 10° samples we obtained 

Var p(y) = 0.704928. 


Because these estimates appear to be stabilizing, we take this as our estimate. So in 
this case, the bootstrap estimate of the MSE of the sample median at the true value of 
0 is given by 

MSEg (w) = 0.007569 + 0.704928 = 0.71250. 


Note that the estimated MSE of the sample average is given by s* = 0.62410, so 
the sample mean and sample median appear to be providing similar accuracy in this 
problem. In Figure 6.4.1, we have plotted a density histogram of the sample medians 
obtained from the m = 10° bootstrap samples. Note that the histogram is very skewed. 
See Appendix B for more details on how these computations were carried out. 


0.64 


0.55 


0.45 


0.35 


Density 


0.25 


0.15 


0.0 f i 
-5 -4 -3 -2 - 0 1 
sample median 
Figure 6.4.1: A density histogram of m = 10° sample medians, each obtained from a bootstrap 
sample of size n = 15 from the data in Example 6.4.2. 


Even with the very small sample size here, it was necessary to use the computer to 
carry out our calculations. To evaluate (6.4.4) exactly would have required computing 


Chapter 6: Likelihood Inference 355 


the median of 15!5 (roughly 4.4 x 10!7) samples, which is clearly impossible even 
using a computer. So the bootstrap is a very useful device. E 


The validity of the bootstrapping technique depends on y having its first two mo- 
ments. So the family {Pg : 0 € Q} must be appropriately restricted, but we can see that 
the technique is very general. 

Broadly speaking, it is not clear how to choose m. Perhaps the most direct method 
is to implement bootstrapping for successively higher values of m and stop when we 
see that the results stabilize for several values. This is what we did in Example 6.4.2, 
but it must be acknowledged that this approach is not foolproof, as we could have a 
sample (x1, ..., Xn) such that the estimate (6.4.5) is very slowly convergent. 


Bootstrap Confidence Intervals 


Bootstrap methods have also been devised to obtain approximate y -confidence inter- 
vals for characteristics such as w(@) = T (Fo). One very simple method is to simply 
form the bootstrap t y -confidence interval 


Y £tasyprn — 1),/ Var p(y), 


where f(14y)/2(” — 1) is the (1 + y )/2th quantile of the ¢(7 — 1) distribution. Another 
possibility is to compute a bootstrap percentile confidence interval given by 


(W-y)/2> Vasyy/2)> 


where y p denotes the pth empirical quantile of y in the bootstrap sample of m. 

It should be noted that to be applicable, these intervals require some conditions to 
hold. In particular, y should be at least approximately unbiased for w (0) and the boot- 
strap distribution should be approximately normal. Looking at the plot of the bootstrap 
distribution in Figure 6.4.1 we can see that the median does not have an approximately 
normal bootstrap distribution, so these intervals are not applicable with the median. 

Consider the following example. 


EXAMPLE 6.4.3 The 0.25-Trimmed Mean as an Estimator of the Population Mean 
One of the virtues of the sample median as an estimator of the population mean is 
that it is not affected by extreme values in the sample. On the other hand, the sample 
median discards all but one or two of the data values and so seems to be discarding 
a lot of information. Estimators known as trimmed means can be seen as an attempt 
at retaining the virtues of the median while at the same time not discarding too much 
information. Let |x | denote the greatest integer less than or equal to x € R!. 


Definition 6.4.1 For a € [0, 1], a sample a-trimmed mean is given by 


1 n—|an| 
X(i)> 


n—2 lan] Gn 


where xç) is the ith-order statistic. 


356 Section 6.4: Distribution-Free Methods 


Thus for a sample a-trimmed mean, we toss out (approximately) an of the smallest 
data values and an of the largest data values and calculate the average of the n — 2an 
of the data values remaining. We need the greatest integer function because in general, 
an will not be an integer. Note that the sample mean arises with a = 0 and the sample 
median arises with a = 0.5. 

For the data in Example 6.4.1 and a = 0.25, we have (0.25)15 = 3.75, so we 
discard the three smallest and three largest observations leaving the nine data val- 
ues —3.9, —3.5, —2.9, —2.0, —2.0, —1.7, —1.0, —0.6, —0.2. The average of these nine 
values gives y = X0.25 = — 1.97778, which we note is close to both the sample median 
and the sample mean. 

Now suppose we use a 0.25-trimmed mean as an estimator y of a population mean 
where we believe the population distribution is symmetric. Consider the data in Ex- 
ample 6.4.1 and suppose we generated m = 104 bootstrap samples. We have plotted a 
histogram of the 10* values of y in Figure 6.4.2. Notice that it is very normal looking, 
so we feel justified in using the confidence intervals associated with the bootstrap. In 


this case, we obtained 
\ Var p(y) = 0.7380, 


so the bootstrap ¢ 0.95-confidence interval for the mean is given by —1.97778 + 
(2.14479)(0.7380) ~ (—3.6, —0.4). Sorting the bootstrap sample gives a bootstrap 
percentile 0.95-confidence interval as (—3.36667, —0.488889) ~ (—3.4, —0.5) which 
shows that the two intervals are very similar. 


0.64 


0.4- 


0.34 m 


Density 
J 


0.2-4 


0.14 


0.0-— mar] li = 


54. 45 36 27 -18 -09 0.0 0.9 
.25-trimmed mean 


Figure 6.4.2: A density histogram of m = 104 sample 0.25-trimmed means, each obtained 
from a bootstrap sample of size n = 15 from the data in Example 6.4.3 


| 
More details about the bootstrap can be found in An Introduction to the Bootstrap, 
by B. Efron and R. J. Tibshirani (Chapman and Hall, New York, 1993). 


Chapter 6: Likelihood Inference 357 


6.4.3 | The Sign Statistic and Inferences about Quantiles 


Suppose that {Pg : 0 € Q} is the set of all distributions on R! such that the associated 
distribution functions are continuous. Suppose we want to make inferences about a pth 
quantile of Pg. We denote this quantile by xp (0) so that, when the distribution function 
associated with Pg is denoted by Fg, we have p = Fọ(xp(0)). Note that continuity 
implies there is always a solution in x to p = Fo(x), and that x,(@) is the smallest 
solution. 

Recall the definitions and discussion of estimation of these quantities in Example 
5.5.2 based on a sample (x1,...,X,). For simplicity, let us restrict attention to the 
cases where p = i /n for some i € {1,...,n}. In this case, we have that ĉp = x(j is 
the natural estimate of xp. 

Now consider assessing the evidence in the data concerning the hypothesis Hp : 
Xp(0) = xo. For testing this hypothesis, we can use the sign test statistic, given by 
S = $; L(—c0,x9] i). So S is the number of sample values less than or equal to x9. 

Notice that when Ap is true, [(—o0,x9](%1), - - - s L(—00,x9] Qn) is a sample from the 
Bernoulli(p) distribution. This implies that, when Họ is true, S ~ Binomial(n, p). 

Therefore, we can test Hy by computing the observed value of S, denoted S,, and 
seeing whether this value lies in a region of low probability for the Binomial(n, p) dis- 
tribution. Because the binomial distribution is unimodal, the regions of low probability 
correspond to the left and right tails of this distribution. See, for example, Figure 6.4.3, 
where we have plotted the probability function of a Binomial(20, 0.7) distribution. 

The P-value is therefore obtained by computing the probability of the set 


|: (Jra Spie (; Jta 2 oy} (6.4.6) 


using the Binomial(n, p) probability distribution. This is a measure of how far out in 
the tails the observed value So is (see Figure 6.4.3). Notice that this P-value is com- 
pletely independent of 0 and is thus valid for the entire model. Tables of binomial 
probabilities (Table D.6 in Appendix D), or built-in functions available in most statis- 
tical packages, can be used to calculate this P-value. 


. 
T T T 
0 10 20 


Figure 6.4.3: Plot of the Binomial (20, 0.7) probability function. 


When n is large, we have that, under Ho, 


| ae SP 


Jnp (1 — p) 


358 Section 6.4: Distribution-Free Methods 


asn — oo. Therefore, an approximate P-value is given by 


2 fi -0 ( 

(as in Example 6.3.11), where we have replaced S, by Sọ — 0.5 as a correction for 
continuity (see Example 4.4.9 for discussion of the correction for continuity). 

A special case arises when p = 1/2, i.e., when we are making inferences about 

an unknown population median xọ.5 (0). In this case, the distribution of S under Ho is 


Binomial(n, 1/2). Because the Binomial(n, 1/2) is unimodal and symmetrical about 
n/2, (6.4.6) becomes 


So — 0.5 — np )} 
Jnp (1 — p) 


{i [So —n/2| < li —n/2]}. 


If we want a y-confidence interval for xo,5(@), then we can use the equivalence 
between tests, which we always reject when the P-value is less than or equal to 1 — y, 
and y -confidence intervals (see Example 6.3.12). For this, let j be the smallest integer 
greater than n/2 satisfying 


P({ : li —n/2| > j—n/2})<1-y7, (6.4.7) 
where P is the Binomial(, 1/2) distribution. If S e {i : |i —n/2| > j —n/2}, we 
will reject Hp : xo.5s(0) = xo at the 1 — y level and will not otherwise. This leads 


to the y -confidence interval, namely, the set of all those values xo,5 such that the null 
hypothesis Ho : x9.5(9) = xo.5 is not rejected at the 1 — y level, equaling 


C(ai iian) = |oo: 


$ Ico,x]@:) — 2/2] < j - nal 
i=l 


n 

= g n= j < >) I-co,x91@i) < | = [xaj ¥0)) (6.4.8) 
i=l 

because, for example, n — j < 57? K—o0,xo];) if and only if xo > X(n=j+1)- 

EXAMPLE 6.4.4 Application of the Sign Test 


Suppose we have the following sample of size n = 10 from a continuous random 
variable X, and we wish to test the hypothesis Ho : xo,5(@) = 0. 


0.44 —0.06 043 -0.16 —2.13 


1.15 1.08 5.67 —4.97 0.11 


The boxplot in Figure 6.4.4 indicates that it is very unlikely that this sample came from 
a normal distribution, as there are two extreme observations. So it is appropriate to 
measure the location of the distribution of X by the median. 


Chapter 6: Likelihood Inference 359 


Figure 6.4.4: Boxplot of the data in Example 6.4.4. 


In this case, the sample median (using (5.5.4)) is given by (0.11 + 0.43) /2 = 0.27. 
The sign statistic for the null is given by 


10 
S =D" I-c0,0)@i) = 4. 
i=l 


The P-value is given by 
P({i: 4- 5| < |i -— 5} = P({i: |i -5> 1) =1- Pi: |i —5] < 1) 


>) 


and we have no reason to reject the null hypothesis. 
Now suppose that we want a 0.95-confidence interval for the median. Using soft- 
ware (or Table D.6), we calculate 


10\ /1\ 2 
=1- P({5}=1-— 5) = 1 = 0.24609 = 0.75391, 


(19) o” = 0.24609 (1°) 6)" = 0.20508 
3) (4) = 0.11719 10) (4) = 4.3945 x 107? 
3) XZ), 4 2)\2) o 
(AE) =9.7656 x 103 (PG) = 9.7656 x 1074. 
We will use these values to compute the value of 7 in (6.4.7). 
We can use the symmetry of the Binomial(10, 1/2) distribution about n /2 to com- 


pute the values of P({i : |i — n/2| > j —n/2}) as follows. For j = 10, we have that 
(6.4.7) equals 


ae 10\ (1\"° m 
Pi: |i — 5| > 5} = P({0, 10} = 2 oJ = 1.9531 x 107, 
and note that 1.9531 x 107? < 1—0.95 = 0.05. For j = 9, we have that (6.4.7) equals 


ii 1 10 10 1 10 
P(0,1,9, 10) =2(9) (5) +2(7) (5) 


= 2.1484 x 10-7, 


P({i : |i — 5] = 4}) 


360 Section 6.4: Distribution-Free Methods 


which is also less than 0.05. For j = 8, we have that (6.4.7) equals 


P({i : |i —5| > 3} P({0, 1,2, 8, 9, 10}) 


09) AOAC" 


= 0.10938, 


and this is greater than 0.05. Therefore, the appropriate value is j = 9, and a 0.95- 
confidence interval for the median is given by [x (2), x(9)) = [—0.16, 1.15). E 


There are many other distribution-free methods for a variety of statistical situations. 
While some of these are discussed in the problems, we leave a thorough study of such 
methods to further courses in statistics. 


Summary of Section 6.4 


e Distribution-free methods of statistical inference are appropriate methods when 
we feel we can make only very minimal assumptions about the distribution from 
which we are sampling. 


e The method of moments, bootstrapping, and methods of inference based on the 
sign statistic are three distribution-free methods that are applicable in different 
circumstances. 


EXERCISES 


6.4.1 Suppose we obtained the following sample from a distribution that we know has 
its first six moments. Determine an approximate 0.95-confidence interval for 3. 


3.27 -124 3.97 2.25 347 —0.09 7.45 6.20 3.74 4.12 


142 2.75 —1.48 497 800 3.26 0.15 —3.64 4.88 4.55 


6.4.2 Determine the method of moments estimator of the population variance. Is this 
estimator unbiased for the population variance? Justify your answer. 

6.4.3 (Coefficient of variation) The coefficient of variation for a population measure- 
ment with nonzero mean is given by ø / u, where u is the population mean and ø is the 
population standard deviation. What is the method of moments estimate of the coeffi- 
cient of variation? Prove that the coefficient of variation is invariant under rescalings of 
the distribution, i.e., under transformations of the form T(x) = cx for constant c > 0. 
It is this invariance that leads to the coefficient of variation being an appropriate mea- 
sure of sampling variability in certain problems, as it is independent of the units we use 
for the measurement. 

6.4.4 For the context described in Exercise 6.4.1, determine an approximate 0.95- 
confidence interval for exp(u1). 

6.4.5 Verify that the third moment of an N (u, g?) distribution is given by 13 = 
u? +307. Because the normal distribution is specified by its first two moments, 
any characteristic of the normal distribution can be estimated by simply plugging in 


Chapter 6: Likelihood Inference 361 


the MLE estimates of u and o”. Compare the method of moments estimator of 13 
with this plug-in MLE estimator, i.e., determine whether they are the same or not. 
6.4.6 Suppose we have the sample data 1.48, 4.10, 2.02, 56.59, 2.98, 1.51, 76.49, 
50.25, 43.52, 2.96. Consider this as a sample from a normal distribution with unknown 
mean and variance, and assess the hypothesis that the population median (which is 
the same as the mean in this case) is 3. Also carry out a sign test that the population 
median is 3 and compare the results. Plot a boxplot for these data. Does this support 
the assumption that we are sampling from a normal distribution? Which test do you 
think is more appropriate? Justify your answer. 

6.4.7 Determine the empirical distribution function based on the sample given below. 


0.40 1.36 —0.35 
—0.58 —0.24 —1.34 


—1.35 2.05 1.06 
2.13 —0.03 —1.29 


Using the empirical cdf, determine the sample median, the first and third quartiles, and 
the interquartile range. What is your estimate of F (2)? 

6.4.8 Suppose you obtain the sample of n = 3 distinct values given by 1, 2, and 3. 

(a) Write down all possible bootstrap samples. 

(b) If you are bootstrapping the sample median, what are the possible values for the 
sample median for a bootstrap sample? 

(c) If you are bootstrapping the sample mean, what are the possible values for the 
sample mean for a bootstrap sample? 

(d) What do you conclude about the bootstrap distribution of the sample median com- 
pared to the bootstrap distribution of the sample mean? 

6.4.9 Explain why the central limit theorem justifies saying that the bootstrap distri- 
bution of the sample mean is approximately normal when n and m are large. What 
result justifies the approximate normality of the bootstrap distribution of a function of 
the sample mean under certain conditions? 

6.4.10 For the data in Exercise 6.4.1, determine an approximate 0.95-confidence inter- 
val for the population median when we assume the distribution we are sampling from 
is symmetric with finite first and second moments. (Hint: Use large sample results.) 
6.4.11 Suppose you have a sample of n distinct values and are interested in the boot- 
strap distribution of the sample range given by x(n) — x(a). What is the maximum 
number of values that this statistic can take over all bootstrap samples? What are the 
largest and smallest values that the sample range can take in a bootstrap sample? Do 
you think the bootstrap distribution of the sample range will be approximately normal? 
Justify your answer. 

6.4.12 Suppose you obtain the data 1.1, —1.0, 1.1, 3.1, 2.2, and 3.1. How many dis- 
tinct bootstrap samples are there? 


362 Section 6.4: Distribution-Free Methods 


COMPUTER EXERCISES 


6.4.13 For the data of Exercise 6.4.7, assess the hypothesis that the population median 
is 0. State a 0.95-confidence interval for the population median. What is the exact 
coverage probability of this interval? 

6.4.14 For the data of Exercise 6.4.7, assess the hypothesis that the first quartile of the 
distribution we are sampling from is —1.0. 

6.4.15 With a bootstrap sample size of m = 1000, use bootstrapping to estimate the 
MSE of the plug-in MLE estimator of u3 for the normal distribution, using the sample 
data in Exercise 6.4.1. Determine whether m = 1000 is a large enough sample for 
accurate results. 

6.4.16 For the data of Exercise 6.4.1, use the plug-in MLE to estimate the first quartile 
of an N(u, o°) distribution. Use bootstrapping to estimate the MSE of this estimate 
for m = 10° and m = 104 (use (5.5.3) to compute the first quartile of the empirical 
distribution). 

6.4.17 For the data of Exercise 6.4.1, use the plug-in MLE to estimate F(3) for an 
N(u, o°) distribution. Use bootstrapping to estimate the MSE of this estimate for 
m = 103 and m = 10+. 

6.4.18 For the data of Exercise 6.4.1, form a 0.95-confidence interval for u assuming 
that this is a sample from an N (u, o°) distribution. Also compute a 0.95-confidence 
interval for u based on the sign statistic, a bootstrap ¢ 0.95-confidence interval, and 
a bootstrap percentile 0.95-confidence interval using m = 10° for the bootstrapping. 
Compare the four intervals. 

6.4.19 For the data of Exercise 6.4.1, use the plug-in MLE to estimate the first quintile, 
i.e., x02, ofan N (u, o?) distribution. Plot a density histogram estimate of the bootstrap 
distribution of this estimator for m = 10° and compute a bootstrap ¢ 0.95-confidence 
interval for x9,2, if you think it is appropriate. 

6.4.20 For the data of Exercise 6.4.1, use the plug-in MLE to estimate 3 of an 
N(u, o°) distribution. Plot a density histogram estimate of the bootstrap distribu- 
tion of this estimator for m = 10° and compute a bootstrap percentile 0.95-confidence 
interval for u3, if you think it is appropriate. 


PROBLEMS 


6.4.21 Prove that when (x1, ...,Xn) is a sample of distinct values from a distribution 
on R!, then the ith moment of the empirical distribution on R! (i.e., the distribution 
with cdf given by F) is Mj. 

6.4.22 Suppose that (x1, . . . , xn) isa sample from a distribution on R!. Determine the 
general form of the ith moment of F , i.e., in contrast to Problem 6.4.21, we are now 
allowing for several of the data values to be equal. 

6.4.23 (Variance stabilizing transformations) From the delta theorem, we have that 
y (M1) is asymptotically normal with mean y (u1) and variance (y’(“4))*o7/n when 
y is continuously differentiable, y’(u,) 4 0, and M is asymptotically normal with 
mean u; and variance o7/n. In some applications, it is important to choose the trans- 
formation y so that the asymptotic variance does not depend on the mean 411, i.e., 


Chapter 6: Likelihood Inference 363 


(w'(u1))*o7 is constant as u; varies (note that c? may change as u; changes). Such 
transformations are known as variance stabilizing transformations. 

(a) If we are sampling from a Poisson(A) distribution, then show that w(x) = ./x is 
variance stabilizing. 

(b) If we are sampling from a Bernoulli(@) distribution, show that w(x) = arcsin ./x 
is variance stabilizing. 

(c) If we are sampling from a distribution on (0, co) whose variance is proportional 
to the square of its mean (like the Gamma(a, £) distribution), then show that w(x) = 
In (x) is variance stabilizing. 


CHALLENGES 


6.4.24 Suppose that X has an absolutely continuous distribution on R! with density f 
that is symmetrical about its median. Assuming that the median is 0, prove that |X| 
and 


-1 x <0 
sen(X) = 0 x=0 
1 x>0 


are independent, with |X| having density 2f and sgn(X) uniformly distributed on 
{-1, 1}. 

6.4.25 (Fisher signed deviation statistic) Suppose that (x1, .. ., Xn) is a sample from 
an absolutely continuous distribution on R! with density that is symmetrical about its 
median. Suppose we want to assess the hypothesis Ho : xo0.5(@) = xo. 

One possibility for this is to use the Fisher signed deviation test based on the sta- 
tistic St. The observed value of S* is given by St = $;_ |x; — xol sgn (x; — x0). 
We then assess Ho by comparing S+ with the conditional distribution of S+ given the 
absolute deviations |x, — xo|,..., |Xn — xol. Ifa value Sr occurs near the smallest or 
largest possible value for S*, under this conditional distribution, then we assert that 
we have evidence against Hp. We measure this by computing the P-value given by the 
conditional probability of obtaining a value as far, or farther, from the center of the 
conditional distribution of S* using the conditional mean as the center. This is an ex- 
ample of a randomization test, as the distribution for the test statistic is determined by 
randomly modifying the observed data (in this case, by randomly changing the signs 
of the deviations of the x; from xo). 

(a) Prove that S =n (¥ — xo). 

(b) Prove that the P-value described above does not depend on which distribution we 
are sampling from in the model. Prove that the conditional mean of S* is 0 and the 
conditional distribution of ST is symmetric about this value. 

(c) Use the Fisher signed deviation test statistic to assess the hypothesis Ho : x0.5(0) = 
2 when the data are 2.2, 1.5, 3.4, 0.4, 5.3, 4.3, 2.1, with the results declared to be 
statistically significant if the P-value is less than or equal to 0.05. (Hint: Based on the 
results obtained in part (b), you need only compute probabilities for the extreme values 
of ST.) 


364 Section 6.5: Large Sample Behavior of the MLE (Advanced) 


(d) Show that using the Fisher signed deviation test statistic to assess the hypothesis 
Ho : xo.5s(0) = xo is equivalent to the following randomized t-test statistic hypothesis 
assessment procedure. For this, we compute the conditional distribution of 


T= (X — xo) 
S/Jn 
when the |X; — xo] = |x; — xo] are fixed and the sgn (X; — xo) are i.i.d. uniform on 


{—1, 1}. Compare the observed value of the t-statistic with this distribution, as we 
did for the Fisher signed deviation test statistic. (Hint: Show that >°?_, Qi — x? = 
SL (ti — xo)? —n (¥ — xo)? and that large absolute values of T correspond to large 
absolute values of S*.) 


6.5 | Asymptotics for the MLE (Advanced) 


As we saw in Examples 6.3.7 and 6.3.11, implementing exact sampling procedures 
based on the MLE can be difficult. In those examples, because the MLE was the sample 
average and we could use the central limit theorem, large sample theory allowed us to 
work out approximate procedures. In fact, there is some general large sample theory 
available for the MLE that allows us to obtain approximate sampling inferences. This 
is the content of this section. The results we develop are all for the case when @ is one- 
dimensional. Similar results exist for the higher-dimensional problems, but we leave 
those to a later course. 

In Section 6.3, the basic issue was the need to measure the accuracy of the MLE. 
One approach is to plot the likelihood and examine how concentrated it is about its 
peak, with a more highly concentrated likelihood implying greater accuracy for the 
MLE. There are several problems with this. In particular, the appearance of the likeli- 
hood will depend greatly on how we choose the scales for the axes. With appropriate 
choices, we can make a likelihood look as concentrated or as diffuse as we want. Also, 
when @ is more than two-dimensional, we cannot even plot the likelihood. One solu- 
tion, when the likelihood is a smooth function of @, is to compute a numerical measure 
of how concentrated the log-likelihood is at its peak. The quantity typically used for 
this is called the observed Fisher information. 


Definition 6.5.1 The observed Fisher information is given by 


2 
ives 0-1 |s) 


30? 0=ô(s) i 


where 6(s) is the MLE. 


The larger the observed Fisher information is, the more peaked the likelihood func- 
tion is at its maximum value. We will show that the observed Fisher information is 
estimating a quantity of considerable importance in statistical inference. 


Chapter 6: Likelihood Inference 365 


Suppose that response X is real-valued, 0 is real-valued, and the model { fg : 0 € Q} 
satisfies the following regularity conditions: 


a In fox) 


Ae exists for each x, (6.5.2) 


E9(S(@|X)) = D ODIO) TE ETE (6.5.3) 


Zœ ô 


œ a fəl 
J. 5 [EAS ao] dx =0, (6.5.4) 


afaa) _ dln fox) 


yoy: So(x), 


so we can write (6.5.3) equivalently as 


ae 


a In fa(x) 


ag? fo(x)dx < oo. (6.5.5) 


Note that we have 


dx =0. 


Also note that (6.5.4) can be written as 


_ ree al |x) 
f S| 70 M| dx 


fe |E a0 
-PE poa 


œ f{8210lx) 810 |x) 
= [7 [TEP +8010] pods = E Ge +9010). 


This together with (6.5.3) and (6.5.5), implies that we can write (6.5.4) equivalently as 


2 
Varo (S(O | X)) = Eo(S?(0|X)) = Eo (-2e 1.0) 


We give a name to the quantity on the left. 


Our developments above have proven the following result. 


Theorem 6.5.1 If (6.5.2) and (6.5.3) are satisfied, then Eg(S(@|X)) = 0. If, in 
addition, (6.5.4) and (6.5.5) are satisfied, then 


a) 
30? j 


1(0) = Varo(S(O |X) = Es ( 


366 Section 6.5: Large Sample Behavior of the MLE (Advanced) 


Now we see why Î is called the observed Fisher information, as it is a natural estimate 
of the Fisher information at the true value @. We note that there is another natural 
estimate of the Fisher information at the true value, given by Z ô). We call this the 
plug-in Fisher information. 

When we have a sample (x1, ... , Xn) from fg, then 


a n n él ; n 
S@lxn xn) = 55a] | fei = 5 = >) 5@ lx). 
i=l i=l 


i=l 


So, if (6.5.3) holds for the basic model, then Eo (S(0 | X1, ..., Xn)) = 0 and (6.5.3) 
also holds for the sampling model. Furthermore, if (6.5.4) holds for the basic model, 
then 


n 82 n 
0 = JE (= in fo(X) + >) Eo OIX) 
i=l i=l 
82 
= Eg (Ste 1X...) + Varg(S(O|X1,...,Xn)), 
which implies 
82 
Varo (S (0 | X1, ..., Xn)) = —Ea (le |X1,... »)) =nlI(9), 


because /(9|x1,...,X%n) = X; In fo(x). Therefore, (6.5.4) holds for the sampling 
model as well, and the Fisher information for the sampling model is given by the sam- 
ple size times the Fisher information for the basic model. We have established the 
following result. 


Corollary 6.5.1 Under i.i.d. sampling from a model with Fisher information J (0). 


the Fisher information for a sample of size n is given by nI (0). 


The conditions necessary for Theorem 6.5.1 to apply do not hold in general and 
have to be checked in each example. There are, however, many models where these 
conditions do hold. 


EXAMPLE 6.5.1 Nonexistence of the Fisher Information 
If X ~ U[0, 0], then fo(x) = 6! Tro,o] (x), which is not differentiable at 0 = x for 
any x. Indeed, if we ignored the lack of differentiability at 9 = x and wrote 


fo (x) 1l 
æ ~ — 53 110.01), 
then © afa(x) eer i 
(x 
J z0 dx L 7 [0,0] œ) dx J #0 


So we cannot define the Fisher information for this model. E 


Chapter 6: Likelihood Inference 367 


EXAMPLE 6.5.2 Location Normal 
Suppose we have a sample (x1,...,X,) from an N (0, o?) distribution where 0 € R! 
is unknown and o? is known. We saw in Example 6.2.2 that 


Nii 
S(O | x1, ..., Xn) = = @ - 0) 
To 


and therefore 5 
f] n 
Pra [X1,...5,Xn) = ae 


a? j 
nI (0) = Eg S O Aaa) = ee 


We also determined in Example 6.2.2 that the MLE is given by A(x1,... Xn) =x: 
Then the plug-in Fisher information is 


` n 
n I (x ) = rato): ’ 
an) 
while the observed Fisher information is 


3 CG liv Xn) n 
£1, ...,%2) = — ———— = 7: 
00 o=x Og 
In this case, there is no need to estimate the Fisher information, but it is comforting 
that both of our estimates give the exact value. E 


We now state, without proof, some theorems about the large sample behavior of the 
MLE under repeated sampling from the model. First, we have a result concerning the 
consistency of the MLE as an estimator of the true value of 0. 


Theorem 6.5.2 Under regularity conditions (like those specified above) for the 


model { fg : 0 € Q}, the MLE 6 exists a.s. and Ô “3 ọ as n > ov. 


PROOF | See Approximation Theorems of Mathematical Statistics, by R. J. Serfling 
(John Wiley & Sons, New York, 1980), for the proof of this result. E 


We see that Theorem 6.5.2 serves as a kind of strong law for the MLE. It also turns 
out that when the sample size is large, the sampling distribution of the MLE is approx- 
imately normal. 


Theorem 6.5.3 Under regularity conditions (like those specified above) for the 


model { fg : 0 € Q}, then TONA (6 — 0) Es N(O, 1) asn > co. 


PROOF | See Approximation Theorems of Mathematical Statistics, by R. J. Serfling 
(John Wiley & Sons, New York, 1980), for the proof of this result. E 


368 Section 6.5: Large Sample Behavior of the MLE (Advanced) 


We see that Theorem 6.5.3 serves as a kind of central limit theorem for the MLE. To 
make this result fully useful to us for inference, we need the following corollary to this 
theorem. 


Corollary 6.5.2 When / is a continuous function of 0, then 


(n1(6))'/2(6 — 0) 3 NO, 1). 


In Corollary 6.5.2, we have estimated the Fisher information /(@) by the plug-in 
Fisher estimation 7 ô). Often it is very difficult to evaluate the function Z. In such a 
case, we instead estimate nJ(@) by the observed Fisher information Îi... Xn). A 
result such as Corollary 6.5.2 again holds in this case. 

From Corollary 6.5.2, we can devise large sample approximate inference methods 
based on the MLE. For example, the approximate standard error of the MLE is 


GLA. 
An approximate y -confidence interval is given by 
ô + (nI ÔN! z042 


Finally, if we want to assess the hypothesis Ho : 0 = 80, we can do this by computing 
the approximate P-value 


2 fı -0 (o1 (00))!/2 6 — 601) } 


Notice that we are using Theorem 6.5.3 for the P-value, rather than Corollary 6.5.2, as, 
when Hp is true, we know the asymptotic variance of the MLE is (nJ(00))~!. So we 
do not have to estimate this quantity. 

When evaluating J is difficult, we can replace nT (0) by Îi., Xn) in the above 
expressions for the confidence interval and P-value. We now see very clearly the sig- 
nificance of the observed information. Of course, as we move from using nI (0) to 
nI (Ô) to TOL tes Xn), we expect that larger sample sizes n are needed to make the 
normality approximation accurate. 

We consider some examples. 


EXAMPLE 6.5.3 Location Normal Model 
Using the Fisher information derived in Example 6.5.2, the approximate y -confidence 
interval based on the MLE is 


Ô + (mI Ô) 72042 =F E Co/V0)Z047)/2: 


This is just the z-confidence interval derived in Example 6.3.6. Rather than being an 
approximate y -confidence interval, the coverage is exact in this case. Similarly, the 
approximate P-value corresponds to the z-test and the P-value is exact. I 


Chapter 6: Likelihood Inference 369 


EXAMPLE 6.5.4 Bernoulli Model 
Suppose that (x1,...,X,) is a sample from a Bernoulli(@) distribution, where 0 e€ 
[0, 1] is unknown. The likelihood function is given by 


L(O|x1,...,Xn) = 0 (0-007, 
and the MLE of 0 is x. The log-likelihood is 
[@|x1,...,Xn) =nxlnð +n (1 —x)In(l -8), 
the score function is given by 


nx n(l—-x 
S@lx,....%) = — AED, 


and n _ Gis 
nx n(l—x 
—S(O|x1,..., = -< —- ——.. 
gat Rivest) pe Gay? 
Therefore, the Fisher information for the sample is 
nX n (1 — ?) n 


nI (0) = Eg (-Gs@14,....%0) -a(g ET 


and the plug-in Fisher information is 


n 
@) x(1—-x) 
Note that the plug-in Fisher information is the same as the observed Fisher information 
in this case. 

So an approximate y -confidence interval is given by 


Ô + (nI ?za4y)2 =¥ £2047) 20% (1 —¥)/n, 


which is precisely the interval obtained in Example 6.3.7 using large sample consider- 
ations based on the central limit theorem. Similarly, we obtain the same P-value as in 
Example 6.3.11 when testing Ho : 0 = 60.0 


EXAMPLE 6.5.5 Poisson Model 
Suppose that (x1, ..., Xn) is a sample from a Poisson(/) distribution, where 2 > 0 is 
unknown. The likelihood function is given by 


LG |e SR Ee. 


The log-likelihood is 
LA|x1,...,X%n) =nx Ind —nd, 


the score function is given by 


nx 
SA |x1,..-4%n) = =A, 


370 Section 6.5: Large Sample Behavior of the MLE (Advanced) 


and = 
nx 


0 
ayo las +++ 2a) = T 


From this we deduce that the MLE of 4 is 2 = x. 
Therefore, the Fisher information for the sample is 


ð n 
IQ) = E |-> S| Xi,..., X, = E, (— l=- 
n ( ) a( aA ( | l; > ») ( ) ga 
and the plug-in Fisher information is 


nl (x) = 


x| 3 


Note that the plug-in Fisher information is the same as the observed Fisher information 
in this case. 
So an approximate y -confidence interval is given by 


At (nly)? za4y)/2 =X EZ 4y)2VX/n. 


Similarly, the approximate P-value for testing Ho : A = Ao is given by 


2 {1 -0 (0100)? ia 2ol)} =) {1 -0 (0/10)? Iz — 2ol)} 


Note that we have used the Fisher information evaluated at Ao for this test. E 


Summary of Section 6.5 


e Under regularity conditions on the statistical model with parameter 0, we can 
define the Fisher information 7 (0) for the model. 

e Under regularity conditions on the statistical model, it can be proved that, when 
0 is the true value of the parameter, the MLE is consistent for 0 and the MLE 
is approximately normally distributed with mean given by 0 and with variance 
given by (n1 (0))7}. 

e The Fisher information /(@) can be estimated by plugging in the MLE or by 
using the observed Fisher information. These estimates lead to practically useful 
inferences for 0 in many problems. 


EXERCISES 


6.5.1 If (x1, ..., Xn) is a sample from an N (uo, 07) distribution, where uo is known 
and g? e (0, 00) is unknown, determine the Fisher information. 

6.5.2 If (1,...,Xn) is a sample from a Gamma(ao, 0) distribution, where ao is known 
and @ e (0, oo) is unknown, determine the Fisher information. 

6.5.3 If (x1,..., Xn) is a sample from a Pareto(a) distribution (see Exercise 6.2.9), 
where a > 0 is unknown, determine the Fisher information. 


Chapter 6: Likelihood Inference 371 


6.5.4 Suppose the number of calls arriving at an answering service during a given 
hour of the day is Poisson(/), where 2 € (0, co) is unknown. The number of calls 
actually received during this hour was recorded for 20 days and the following data 
were obtained. 


9 10 8 12 11 12 5 13 9 9 
7 5 16 13 9 5 13 8 9 10 


Construct an approximate 0.95-confidence interval for 2. Assess the hypothesis that 
this is a sample from a Poisson(11) distribution. If you are going to decide that the 
hypothesis is false when the P-value is less than 0.05, then compute an approximate 
power for this procedure when 4 = 10. 

6.5.5 Suppose the lifelengths in hours of lightbulbs from a manufacturing process are 
known to be distributed Gamma(2, 0), where @ e (0,00) is unknown. A random 
sample of 27 bulbs was taken and their lifelengths measured with the following data 
obtained. 


336.87 2750.71 2199.44 292.99 1835.55 1385.36 2690.52 
710.64 2162.01 1856.47 2225.68 3524.23 2618.51 361.68 


979.54 2159.18 1908.94 1397.96 914.41 1548.48 1801.84 
1016.16 1666.71 1196.42 1225.68 2422.53 753.24 


Determine an approximate 0.90-confidence interval for @. 

6.5.6 Repeat the analysis of Exercise 6.5.5, but this time assume that the lifelengths 
are distributed Gamma(1, 0). Comment on the differences in the two analyses. 

6.5.7 Suppose that incomes (measured in thousands of dollars) above $20K can be 
assumed to be Pareto(a), where a > 0 is unknown, for a particular population. A 
sample of 20 is taken from the population and the following data obtained. 


21.265 20.857 21.090 20.047 20.019 32.509 21.622 20.693 
20.109 23.182 21.199 20.035 20.084 20.038 22.054 20.190 


20.488 20.456 20.066 20.302 


Construct an approximate 0.95-confidence interval for a. Assess the hypothesis that 
the mean income in this population is $25K. 


6.5.8 Suppose that (x1, ...,x,) is a sample from an Exponential(9) distribution. Con- 
struct an approximate left-sided y -confidence interval for 0. (See Problem 6.3.25.) 


6.5.9 Suppose that (x;,...,x,) is a sample from a Geometric(@) distribution. Con- 
struct an approximate left-sided y -confidence interval for 0. (See Problem 6.3.25.) 


6.5.10 Suppose that (x1, ...,x,) is a sample from a Negative-Binomial(r, 0) distrib- 
ution. Construct an approximate left-sided y -confidence interval for 0. (See Problem 
6.3.25.) 


PROBLEMS 


6.5.11 In Exercise 6.5.1, verify that (6.5.2), (6.5.3), (6.5.4), and (6.5.5) are satisfied. 
6.5.12 In Exercise 6.5.2, verify that (6.5.2), (6.5.3), (6.5.4), and (6.5.5) are satisfied. 


372 Section 6.5: Large Sample Behavior of the MLE (Advanced) 


6.5.13 In Exercise 6.5.3, verify that (6.5.2), (6.5.3), (6.5.4), and (6.5.5) are satisfied. 
6.5.14 Suppose that sampling from the model {fọ : 6 € Q} satisfies (6.5.2), (6.5.3), 
(6.5.4), and (6.5.5). Prove that nS I(0)asn > œ. 


6.5.15 (MV) When 0 = (01, 02) , then, under appropriate regularity conditions for the 
model { fg : 0 € Q}, the Fisher information matrix is defined by 


Eo (2 Q 1.0) Eo (-a25! @ 149) 
1@)= 
Eo (—sa2am5! (Q Be) Eo (-# (0 x) 


If (X1, X2, X3) ~ Multinomial(1, 01,02,03) (Example 6.1.5), then determine the 
Fisher information for this model. Recall that 03 = 1 — 0; — 02 and so is determined 
from (01, 02). 

6.5.16 (MV) Generalize Problem 6.5.15 to the case where 


(Xi, ..., Xk) ~ Multinomial (1,01, ..., 0%). 


6.5.17 (MV) Using the definition of the Fisher information matrix in Exercise 6.5.15, 
determine the Fisher information for the Bivariate Normal (u41, 42, 1, 1, 0) model, where 
Ui, u2 € R! are unknown. 

6.5.18 (MV) Extending the definition in Exercise 6.5.15 to the three-dimensional case, 
determine the Fisher information for the Bivariate Normal (u1, Lo, o?, oc, 0) model 
where u1, 2 € R!, and o? > 0 are unknown. 


CHALLENGES 


6.5.19 Suppose that model {fg : 0 € Q} satisfies (6.5.2), (6.5.3), (6.5.4), (6.5.5), and 
has Fisher information Z (0). If ¥ : Q —> R! is 1-1, and ¥ and Y7! are continuously 
differentiable, then, putting Y = {¥ (0) : 0 € Q}, prove that the model given by {gy : 
y e T} satisfies the regularity conditions and that its Fisher information at y is given 


by LCP (w)) (CB) y. 


DISCUSSION TOPICS 


6.5.20 The method of moments inference methods discussed in Section 6.4.1 are es- 
sentially large sample methods based on the central limit theorem. The large sample 
methods in Section 6.5 are based on the form of the likelihood function. Which meth- 
ods do you think are more likely to be correct when we know very little about the form 
of the distribution from which we are sampling? In what sense will your choice be 
“more correct’? 


Chapter 7 
Bayesian Inference 


CHAPTER OUTLINE 


Section 1 The Prior and Posterior Distributions 
Section 2 _Inferences Based on the Posterior 
Section 3 Bayesian Computations 

Section 4 Choosing Priors 

Section 5 Further Proofs (Advanced) 


In Chapter 5, we introduced the basic concepts of inference. At the heart of the the- 
ory of inference is the concept of the statistical model {fọ : 0 € Q} that describes the 
statistician’s uncertainty about how the observed data were produced. Chapter 6 dealt 
with the analysis of this uncertainty based on the model and the data alone. In some 
cases, this seemed quite successful, but we note that we only dealt with some of the 
simpler contexts there. 

If we accept the principle that, to be amenable to analysis, all uncertainties need to 
be described by probabilities, then the prescription of a model alone is incomplete, as 
this does not tell us how to make probability statements about the unknown true value 
of 0. In this chapter, we complete the description so that all uncertainties are described 
by probabilities. This leads to a probability distribution for 0, and, in essence, we are in 
the situation of Section 5.2, with the parameter now playing the role of the unobserved 
response. This is the Bayesian approach to inference. 

Many statisticians prefer to develop statistical theory without the additional ingre- 
dients necessary for a full probability description of the unknowns. In part, this is 
motivated by the desire to avoid the prescription of the additional model ingredients 
necessary for the Bayesian formulation. Of course, we would prefer to have our sta- 
tistical analysis proceed based on the fewest and weakest model assumptions possible. 
For example, in Section 6.4, we introduced distribution-free methods. A price is paid 
for this weakening, however, and this typically manifests itself in ambiguities about 
how inference should proceed. The Bayesian formulation in essence removes the am- 
biguity, but at the price of a more involved model. 

The Bayesian approach to inference is sometimes presented as antagonistic to meth- 
ods that are based on repeated sampling properties (often referred to as frequentist 


373 


374 Section 7.1: The Prior and Posterior Distributions 


methods), as discussed, for example, in Chapter 6. The approach taken in this text, 
however, is that the Bayesian model arises naturally from the statistician assuming 
more ingredients for the model. It is up to the statistician to decide what ingredients 
can be justified and then use appropriate methods. We must be wary of all model 
assumptions, because using inappropriate ones may invalidate our inferences. Model 
checking will be taken up in Chapter 9. 


7.1| The Prior and Posterior Distributions 


The Bayesian model for inference contains the statistical model {fg : 9 € Q} for the 
data s € S and adds to this the prior probability measure II for 0. The prior describes 
the statistician’s beliefs about the true value of the parameter 0 a priori, i.e., before 
observing the data. For example, if Q = [0, 1] and 0 equals the probability of getting 
a head on the toss of a coin, then the prior density z plotted in Figure 7.1.1 indicates 
that the statistician has some belief that the true value of @ is around 0.5. But this in- 
formation is not very precise. 


prior 


0.5 


0.0 


00 01 02 03 04 05 06 07 08 09, 10 
theta 


Figure 7.1.1: A fairly diffuse prior on [0,1]. 


On the other hand, the prior density z plotted in Figure 7.1.2indicates that the statis- 
tician has very precise information about the true value of 0. In fact, if the statistician 
knows nothing about the true value of 0, then using the uniform distribution on [0, 1] 
might be appropriate. 


Chapter 7: Bayesian Inference 375 


prior 
10+ 


o +++ + +_ +++ +_ + +++ +_ +++ 44 
01 02 03 04 05 06 07 08 09 fe 
theta 


Figure 7.1.2: A fairly precise prior on [0,1]. 


It is important to remember that the probabilities prescribed by the prior repre- 
sent beliefs. They do not in general correspond to long-run frequencies, although they 
could in certain circumstances. A natural question to ask is: Where do these beliefs 
come from in an application? An easy answer is to say that they come from previous 
experience with the random system under investigation or perhaps with related sys- 
tems. To be honest, however, this is rarely the case, and one has to admit that the 
prior, as well as the statistical model, is often a somewhat arbitrary construction used 
to drive the statistician’s investigations. This raises the issue as to whether or not the 
inferences derived have any relevance to the practical context, if the model ingredients 
suffer from this arbitrariness. This is where the concept of model checking comes into 
play, a topic we will discuss in Chapter 9. At this point, we will assume that all the 
ingredients make sense, but remember that in an application, these must be checked if 
the inferences taken are to be practically meaningful. 

We note that the ingredients of the Bayesian formulation for inference prescribe a 
marginal distribution for 0, namely, the prior II, and a set of conditional distributions 
for the data s given 0, namely, {fg : @ € Q}. By the law of total probability (Theorems 
2.3.1 and 2.8.1), these ingredients specify a joint distribution for (s, 0), namely, 


z (0) fo(s), 


where z denotes the probability or density function associated with II. When the prior 
distribution is absolutely continuous, the marginal distribution for s is given by 


m(s) = T (0) fo(s) d0 


and is referred to as the prior predictive distribution of the data. When the prior distri- 
bution of 8 is discrete, we replace (as usual) the integral by a sum. 


376 Section 7.1: The Prior and Posterior Distributions 


If we did not observe any data, then the prior predictive distribution is the relevant 
distribution for making probability statements about the unknown value of s. Similarly, 
the prior z is the relevant distribution to use in making probability statements about 0, 
before we observe s. Inference about these unobserved quantities then proceeds as 
described in Section 5.2. 

Recall now the principle of conditional probability; namely, P(A) is replaced by 
P(A|C) after we are told that C is true. Therefore, after observing the data, the rel- 
evant distribution to use in making probability statements about 0 is the conditional 
distribution of 0 given s. We denote this conditional probability measure by H (|s) 
and refer to it as the posterior distribution of 0. Note that the density (or probability 
function) of the posterior is obtained immediately by taking the joint density æ (0) fo (s) 
of (s, 0) and dividing it by the marginal m (s) of s. 


Definition 7.1.1 The posterior distribution of @ is the conditional distribution of 
0, given s. The posterior density, or posterior probability function (whichever is 
relevant), is given by 

z (0) fo(s) 


m(s) 


Sometimes this use of conditional probability is referred to as an application of 
Bayes’ theorem (Theorem 1.5.2). This is because we can think of a value of 0 being 
selected first according to z, and then s is generated from fg. We then want to make 
probability statements about the first stage, having observed the outcome of the sec- 
ond stage. It is important to remember, however, that choosing to use the posterior 
distribution for probability statements about 0 is an axiom, or principle, not a theorem. 

We note that in (7.1.1) the prior predictive of the data s plays the role of the inverse 
normalizing constant for the posterior density. By this we mean that the posterior 
density of @ is proportional to z(@) fa(s), as a function of 0; to convert this into a 
proper density function, we need only divide by m(s). In many examples, we do not 
need to compute the inverse normalizing constant. This is because we recognize the 
functional form, as a function of 0, of the posterior from the expression z (0) fo(s) 
and so immediately deduce the posterior probability distribution of 6. Also, there are 
Monte Carlo methods, such as those discussed in Chapter 4, that allow us to sample 
from z (0 |s) without knowing m (s) (also see Section 7.3). 

We consider some applications of Bayesian inference. 


z(@|s) = 


(7.1.1) 


EXAMPLE 7.1.1 Bernoulli Model 

Suppose that we observe a sample (x1, .. . , Xn) from the Bernoulli (0) distribution with 
0 e [0, 1] unknown. For the prior, we take z to be equal to a Beta(a, $) density (see 
Problem 2.4.16). Then the posterior of 0 is proportional to the likelihood 


n 


[[o (1 = 6g) = gr (1 = oyr 0-5 


i=l 


times the prior 
Bape" aor”. 


Chapter 7: Bayesian Inference 377 


This product is proportional to 


gxta-l a = pyar} : 


We recognize this as the unnormalized density of a Beta(nx + a,n (1 —x) + £) dis- 
tribution. So in this example, we did not need to compute m (x1, ..., Xn) to obtain the 
posterior. 

As a specific case, suppose that we observe nx = 10 in a sample of n = 40 and 
a = ß = 1, i.e., we have a uniform prior on @. Then the posterior of @ is given by the 
Beta(11,31) distribution. We plot the posterior density in Figure 7.1.3 as well as the 
prior. 


0.9 1.0 
theta 


Figure 7.1.3: Prior (dashed line) and posterior densities (solid line) in Example 7.1.1. 


The spread of the posterior distribution gives us some idea of the precision of any 
probability statements we make about 0. Note how much information the data have 
added, as reflected in the graphs of the prior and posterior densities. E 


EXAMPLE 7.1.2 Location Normal Model 
Suppose that (x1, .. . , Xn) is a sample from an N (4, a3) distribution, where u € R! is 
unknown and c? is known. The likelihood function is then given by 


n = 
L (u |x, wey Xn) = o0(-35 (x - 0?) 
205 


Suppose we take the prior distribution of u to be an N (uo, T2) for some specified 
choice of uo and T. The posterior density of u is then proportional to 


378 Section 7.1: The Prior and Posterior Distributions 


1 2 nx? 
of -5( +) (7.1.2) 
0 0 


We immediately recognize this, as a function of u, as being proportional to the density 


of an 
1 ie 1 3 
n n n 
To 9 To Tp To g 
distribution. 


Notice that the posterior mean is a weighted average of the prior mean zy and the 
sample mean x, with weights 


—1 -1 
1 a n 1 d 1 $ n n 
— + — — an — + —, 
Bia) r aa CS a 


respectively. This implies that the posterior mean lies between the prior mean and the 
sample mean. 

Furthermore, the posterior variance is smaller than the variance of the sample mean. 
So if the information expressed by the prior is accurate, inferences about u based on 
the posterior will be more accurate than those based on the sample mean alone. Note 
that the more diffuse the prior is — namely, the larger i, is — the less influence the 
prior has. For example, when n = 20 and o? = 1, z = 1, then the ratio of the 
posterior variance to the sample mean variance is 20/21 ~ 0.95. So there has been a 
5% improvement due to the use of prior information. 


Chapter 7: Bayesian Inference 379 


For example, suppose that o? = 1l, 4o = 0, To = 2, and that for n = 10, we 
observe x = 1.2. Then the prior is an N (0, 2) distribution, while the posterior is an 


(Gai -1 /0 10,,\ (1, 10 3 = N(1.1429, 9.5238 x 1072 
ae ae ie A Sa a eae 


distribution. These densities are plotted in Figure 7.1.4. Notice that the posterior is 
quite concentrated compared to the prior, so we have learned a lot from the data. E 


J 
t t t 1 
-5 -4 -3 -2 -1 0 1 2 2 4 5 


Figure 7.1.4: Plot of the N (0, 2) prior (dashed line) and the N (1 .1429, 9.523 8 x 107°) 
posterior (solid line) in Example 7.1.2. 


EXAMPLE 7.1.3 Multinomial Model 
Suppose we have a categorical response s that takes k possible values, say, s € S = 
{1,...,k}. For example, suppose we have a bowl containing chips labelled one of 
1,..., k. A proportion 6; of the chips are labelled i, and we randomly draw a chip, 
observing its label. 

When the 0; are unknown, the statistical model is given by 


{P(01,...0%) 1, ---5 OK) E Q}, 
where p@,,....9,) €) = P(s =i) = 9; and 
Q = {(81,...,0k):0 <0; <1,i =1,...,k and 0; +---+6, =1}. 


Note that the parameter space is really only (k — 1)-dimensional because, for example, 


6, = 1 — 01 — --- —Ox_-1, namely, once we have determined k — 1 of the 0;, the 
remaining value is specified. 
Now suppose we observe a sample (s1, .. . , Sn) from this model. Let the frequency 


(count) of the ith category in the sample be denoted by x;. Then, from Example 2.8.5, 
we see that the likelihood is given by 


L Oi, 0-65 Ok| S15 0-05 5n)) = OVO? OF, 


380 Section 7.1: The Prior and Posterior Distributions 


For the prior we assume that (01, ..., 9x—-1) ~ Dirichlet (a1, a2, ..., ax) with den- 
sity (see Problem 2.7.13) given by 


P(a1 +++: +ak) 


UUTE gege! ge! 7.1.3 

ræ) a |? j oo 

for (01,...,0k) € Q (recall that 6, = 1 — 6, — --- — Ox). The a; are nonnega- 
tive constants chosen by the statistician to reflect her beliefs about the unknown value 
of (01,...,0%). The choice aj = a2 = --- = ag = 1 corresponds to a uniform 


distribution, as then (7.1.3) is constant on Q. 
The posterior density of (01, ..., 0—1) is then proportional to 


greg a a dei ae 


for (01,...,0k) E€ Q. From (7.1.3), we immediately deduce that the posterior distrib- 


ution of (81, ..., @¢—1) is Dirichlet(x, + a1, x2 + a2,..., Xk + ak). E 
EXAMPLE 7.1.4 Location-Scale Normal Model 
Suppose that (x1, .. . , Xn) is a sample from an N (u, 07) distribution, where u € R! 


and ø > 0 are unknown. The likelihood function is then given by 


—n/2 
L (aa |x1,...5%n) = (220°) 


/ n 2 n=l, 
exp (--5 &-— u) Je (- z7 sS ). 


Suppose we put the following prior on (u, 07). First, we specify that 


ulo? ~ N(uo, 190°), 


2 


i.e., the conditional prior distribution of u given g^ is normal with mean jg and vari- 
Did: 


ance tj0~. Then we specify the marginal prior distribution of g? as 
1 
= ~ Gamma(ao, Po). (7.1.4) 
o 


Sometimes (7.1.4) is referred to by saying that ø? is distributed inverse Gamma. The 
values Lg, tae ao, and fo are selected by the statistician to reflect his prior beliefs. 

From this, we can deduce (see Section 7.5 for the full derivation) that the posterior 
distribution of (u, c?) is given by 


al 
1 
ulo, Xi,- Xn ~N m(1+5) o? (7.1.5) 
To 
and i 
z |X1,.. -Xn ~ Gamma (ao + n/2, Px) (7.1.6) 
where 


—1 
1 
tit ¢ + >) (4 ae ns) (7.1.7) 
To To 


Chapter 7: Bayesian Inference 381 


and 5 
n—-1 l a(x — 
Bx = Bo + Ss? p EO 


(7.1.8) 
2 
2 2 14n175 


To generate a value (u, o?) from the posterior, we can make use of the method of 
composition (see Problem 2.10.13) by first generating o? using (7.1.6) and then using 
(7.1.5) to generate u. We will discuss this further in Section 7.3. 

Notice that as tro —> œ, i.e., as the prior on u becomes increasingly diffuse, 
the conditional posterior distribution of u given ø? converges in distribution to an 
N(&, o7/n) distribution because 

My > X (7.1.9) 


-1 
1 1 
(+4) ace (7.1.10) 
TO n 


Furthermore, as tọ —> œo and o — 0, the marginal posterior of 1/ø? converges in 
distribution to a Gamma (ao +n/2, (n — 1)s? /2) distribution because 


and 


Bx > (a — 1)s?/2. (7.1.11) 


Actually, it does not really seem to make sense to let tọ — oo and fy > Oin 
the prior distribution of (u, 07), as the prior does not converge to a proper probability 
distribution. The idea here, however, is that we think of taking to large and fo small, 
so that the posterior inferences are approximately those obtained from the limiting 
posterior. There is still a need to choose ao, however, even in the diffuse case, as the 
limiting inferences are dependent on this quantity. E 


Summary of Section 7.1 


e Bayesian inference adds the prior probability distribution to the sampling model 
for the data as an additional ingredient to be used in determining inferences about 
the unknown value of the parameter. 


e Having observed the data, the principle of conditional probability leads to the 
posterior distribution of the parameter as the basis for inference. 


e Inference about marginal parameters is handled by marginalizing the full poste- 
rior. 


EXERCISES 


7.1.1 Suppose that S$ = {1,2}, Q = {1, 2, 3}, and the class of probability distributions 
for the response s is given by the following table. 


382 Section 7.1: The Prior and Posterior Distributions 


If we use the prior z (@) given by the table 


(el e=2 eas 
m 25 275 | 


then determine the posterior distribution of 0 for each possible sample of size 2. 


7.1.2 In Example 7.1.1, determine the posterior mean and variance of 8. 


7.1.3 In Example 7.1.2, what is the posterior probability that u is positive, given that 
n = 10,x = 1 when o? = 1l, uo = 0, and 14 = 10? Compare this with the prior 
probability of this event. 

7.1.4 Suppose that (x1, ..., Xn) is a sample from a Poisson(4) distribution with 1 > 0 
unknown. If we use the prior distribution for 4 given by the Gamma(a, /) distribution, 
then determine the posterior distribution of 2. 

7.1.5 Suppose that (x1,...,Xn) is a sample from a Uniform[0, @] distribution with 
0 > 0 unknown. If the prior distribution of 0 is Gamma(a, /) , then obtain the form of 
the posterior density of 0. 

7.1.6 Find the posterior mean and variance of 0; in Example 7.1.3 when k = 3. (Hint: 
See Problems 3.2.16 and 3.3.20.) 


7.1.7 Suppose we have a sample 


6.56 639 3.30 3.03 5.31 5.62 5.10 2.45 8.24 3.71 


4.14 2.80 7.43 682 4.75 4.09 7.95 5.84 844 9.36 


from an N(u, o°) distribution and we determine that a prior specified by u |o? ~ 
N(3, 40°), o7? ~ Gamma(1, 1) is appropriate. Determine the posterior distribution 
of (u, 1/07). 

7.1.8 Suppose that the prior probability of 0 being in a set A C Q is 0.25 and the 
posterior probability of 0 being in A is 0.80. 

(a) Explain what effect the data have had on your beliefs concerning the true value of 
0 being in A. 

(b) Explain why a posterior probability is more relevant to report than is a prior proba- 
bility. 

7.1.9 Suppose you toss a coin and put a Uniform[0.4, 0.6] prior on 0, the probability 
of getting a head on a single toss. 

(a) If you toss the coin n times and obtain n heads, then determine the posterior density 
of 0. 

(b) Suppose the true value of @ is, in fact, 0.99. Will the posterior distribution of 0 ever 
put any probability mass around 0 = 0.99 for any sample of n? 

(c) What do you conclude from part (b) about how you should choose a prior ? 


7.1.10 Suppose that for statistical model {fg : © € R!}, we assign the prior density 
z. Now suppose that we reparameterize the model via the function w = ¥ (0), where 
¥ : R! ~ R! is differentiable and strictly increasing. 


(a) Determine the prior density of y. 
(b) Show that m(x) is the same whether we parameterize the model by @ or by w. 


Chapter 7: Bayesian Inference 383 


7.1.11 Suppose that for statistical model { fg :0 € Q}, where Q = {—2, —1, 0, 1, 2, 3}, 
we assign the prior probability function z, which is uniform on Q. Now suppose we 
are interested primarily in making inferences about |0|. 

(a) Determine the prior probability distribution of |0]. Is this distribution uniform? 

(b) A uniform prior distribution is sometimes used to express complete ignorance about 
the value of a parameter. Does complete ignorance about the value of a parameter imply 
complete ignorance about a function of a parameter? Explain. 

7.1.12 Suppose that for statistical model {fọ : 0 € [0, 1]}, we assign the prior density 
m, which is uniform on Q = [0, 1]. Now suppose we are interested primarily in making 
inferences about 7. 

(a) Determine the prior density of 6. Is this distribution uniform? 

(b) A uniform prior distribution is sometimes used to express complete ignorance about 
the value of a parameter. Does complete ignorance about the value of a parameter imply 
complete ignorance about a function of a parameter? Explain. 


COMPUTER EXERCISES 


7.1.13 In Example 7.1.2, when wo = 2,73 = 1,o = 1,n = 20, andx = 8.2, 
generate a sample of 104 (or as large as possible) from the posterior distribution of u 
and estimate the posterior probability that the coefficient of variation is greater than 
0.125, i.e., the posterior probability that co/u > 0.125. Estimate the error in your 
approximation. 

7.1.14 In Example 7.1.2, when wo = 2,73 = 1,05 = 1,n = 20, andx = 8.2, 
generate a sample of 104 (or as large as possible) from the posterior distribution of u 
and estimate the posterior expectation of the coefficient of variation o9/. Estimate 
the error in your approximation. 


7.1.15 In Example 7.1.1, plot the prior and posterior densities on the same graph and 
compare them when n = 30,x = 0.73,a = 3, and $ = 3. (Hint: Calculate the 
logarithm of the posterior density and then exponentiate this. You will need the log- 
gamma function defined by InT (a) for a > 0.) 


PROBLEMS 


7.1.16 Suppose the prior of a real-valued parameter 8 is given by the N(@o, t?) dis- 
tribution. Show that this distribution does not converge to a probability distribution as 
t — oo. (Hint: Consider the limits of the distribution functions.) 

7.1.17 Suppose that (x1, ...,Xn) is a sample from {fg :0 € Q} and that we have a 
prior z. Show that if we observe a further sample (X41, ...,Xn+m), then the posterior 
you obtain from using the posterior æ (-|x1,..., Xn) as a prior, and then condition- 
ing on (%y+41,..-,Xn+m), is the same as the posterior obtained using the prior z and 
conditioning on (x1, ...,%n,Xn+41;---,Xn+m). This is the Bayesian updating property. 
7.1.18 In Example 7.1.1, determine m(x). If you were asked to generate a value from 
this distribution, how would you do it? (Hint: For the generation part, use the theorem 
of total probability.) 


384 Section 7.2: Inferences Based on the Posterior 


7.1.19 Prove that the posterior distribution depends on the data only through the value 
of a sufficient statistic. 


COMPUTER PROBLEMS 


7.1.20 For the data of Exercise 7.1.7, plot the prior and posterior densities of o° over 
(0, 10) on the same graph and compare them. (Hint: Evaluate the logarithms of the 
densities first and then plot the exponential of these values.) 

7.1.21 In Example 7.1.4, when uo = 0,73 = 1,a0 = 2, By = 1,n = 20, X = 8.2, 
and s? = 2.1, generate a sample of 104 (or as large as is feasible) from the posterior 
distribution of o? and estimate the posterior probability that o > 2. Estimate the error 
in your approximation. 

7.1.22 In Example 7.1.4, when uo = 0,75 = 1,ao = 2, Bo = 1,n = 20, % = 8.2, 
and s? = 2.1, generate a sample of 104 (or as large as is feasible) from the posterior 
distribution of ( 4, o?) and estimate the posterior expectation of o. Estimate the error 
in your approximation. 


DISCUSSION TOPICS 


7.1.23 One of the objections raised concerning Bayesian inference methodology is 
that it is subjective in nature. Comment on this and the role of subjectivity in scientific 
investigations. 


7.1.24 Two statisticians are asked to analyze a data set x produced by a system under 
study. Statistician I chooses to use a sampling model { fg : 0 € Q} and prior z z, while 
statistician I chooses to use a sampling model {gy : y € Y} and prior z z7. Comment 
on the fact that these ingredients can be completely different and so the subsequent 
analyses completely different. What is the relevance of this for the role of subjectivity 
in scientific analyses of data? 


7.2| Inferences Based on the Posterior 


In Section 7.1, we determined the posterior distribution of 0 as a fundamental object 
of Bayesian inference. In essence, the principle of conditional probability asserts that 
the posterior distribution z (@ |s) contains all the relevant information in the sampling 
model {fg : 0 € Q}, the prior æ and the data s, about the unknown true value of 0. 
While this is a major step forward, it does not completely tell us how to make the types 
of inferences we discussed in Section 5.5.3. 

In particular, we must specify how to compute estimates, credible regions, and carry 
out hypothesis assessment — which is what we will do in this section. It turns out that 
there are often several plausible ways of proceeding, but they all have the common 
characteristic that they are based on the posterior. 

In general, we are interested in specifying inferences about a real-valued charac- 
teristic of interest y (0). One of the great advantages of the Bayesian approach is that 
inferences about y are determined in the same way as inferences about the full para- 
meter 0, but with the marginal posterior distribution for y replacing the full posterior. 


Chapter 7: Bayesian Inference 385 


This situation can be compared with the likelihood methods of Chapter 6, where it 
is not always entirely clear how we should proceed to determine inferences about y 
based upon the likelihood. Still, we have paid a price for this in requiring the addition 
of another model ingredient, namely, the prior. 

So we need to determine the posterior distribution of y. This can be a difficult task 
in general, even if we have a closed-form expression for z (0 |s). When the posterior 
distribution of @ is discrete, the posterior probability function of y is given by 


oyols))= >) xO»). 


{0:v@)=wo} 


When the posterior distribution of @ is absolutely continuous, we can often find a 
complementing function 1(@) so that h(@) = (w(Q), 4(@)) is 1-1, and such that the 
methods of Section 2.9.2 can be applied. Then, denoting the inverse of this transforma- 
tion by 0 = hT! (y, A), the methods of Section 2.9.2 show that the marginal posterior 
distribution of y has density given by 


« (wols) = f K in OA D aA, (7.2.1) 


where J denotes the Jacobian derivative of this transformation (see Problem 7.2.35). 
Evaluating (7.2.1) can be difficult, and we will generally avoid doing so here. An 
example illustrates how we can sometimes avoid directly implementing (7.2.1) and 
still obtain the marginal posterior distribution of w. 


EXAMPLE 7.2.1 Location-Scale Normal Model 

Suppose that (x1,..., Xn) is a sample from an N(u, 07) distribution, where u € R! 
and ø > 0 are unknown, and we use the prior given in Example 7.1.4. The posterior 
distribution for (, o?) is then given by (7.1.5) and (7.1.6). 

Suppose we are primarily interested in y(u, 07) = 07. We see immediately that 
the marginal posterior of ø? is prescribed by (7.1.6) and thus have no further work to 
do, unless we want a form for the marginal posterior density of c°. We can use the 
methods of Section 2.6 for this (see Exercise 7.2.4). 

If we want the marginal posterior distribution of y (u, 0”) = u, then things are not 
quite so simple because (7.1.5) only prescribes the conditional posterior distribution 
of u given o*. We can, however, avoid the necessity to implement (7.2.1). Note that 
(7.1.5) implies that 

Z = H br ee n ~ NOD, 
(n + 1/19) o 


where u, is given in (7.1.7). Because this distribution does not involve o7, the pos- 
terior distribution of Z is independent of the posterior distribution of ø. Now if X ~ 
Gamma(a, 2), then Y = 2BX ~ Gamma(a, 1/2) = y? (2a) (see Problem 4.6.16 for 
the definition of the general chi-squared distribution) and so, from (7.1.6), 


2 |xX1,..., Xn ~ y’ (2a0 +n), 


386 Section 7.2: Inferences Based on the Posterior 


where J, is given in (7.1.8). Therefore (using Problem 4.6.14), as we are dividing an 
N(0, 1) variable by the square root of an independent x? (2a0 +n) random variable 
divided by its degrees of freedom, we conclude that the posterior distribution of 


r=-— Z = Se _ 
| (o Bx ao 2B 
(245) / Cao +n) (ao+n)(n+1/t3) 


is t (2ao +n). Equivalently, we can say the posterior distribution of u is the same as 


1 2p, 
Ss eee 
Pe VW 2a9-+n\ n+ 1/23 


where T ~ t(2a9 + n). By (7.1.9), (7.1.10), and (7.1.11), we have that the posterior 
distribution of u converges to the distribution of 


T n-l1 s r 
x ——— 
ao +n) Jn 
as To > œ and By > 0. 


In other cases, we cannot avoid the use of (7.2.1) if we want the marginal posterior 
density of y. For example, suppose we are interested in the posterior distribution of the 
coefficient of variation (we exclude the line given by u = 0 from the parameter space) 


E c fie 
yv =y, o ®=-=-|5>) . 
u wo 


Then a complementing function to y is given by 
ine ya 
A= (u, o ) = Ge? 
and it can be shown (see Section 7.5) that 
JOY, A) = yr"? 


If we let z(-|A7!, x1, ...,%n) and p(-|x1,..., Xn) denote the posterior densities of 
4 given J, and the posterior density of 1, respectively, then, from (7.2.1), the marginal 
density of y is given by 


[0,0] 
vf ty AOA A aia p (A lX.. Xn) a rake. (122) 
0 


Without writing this out (see Problem 7.2.22), we note that we are left with a rather 
messy integral to evaluate. E 


In some cases, integrals such as (7.2.2) can be evaluated in closed form; in other 
cases, they cannot. While it is convenient to have a closed form for a density, often 
this is not necessary, as we can use Monte Carlo methods to approximate posterior 


Chapter 7: Bayesian Inference 387 


probabilities and expectations of interest. We will return to this in Section 7.3. We 
should always remember that our goal, in implementing Bayesian inference methods, 
is not to find the marginal posterior densities of quantities of interest, but rather to have 
a computational algorithm that allows us to implement our inferences. 

Under fairly weak conditions, it can be shown that the posterior distribution of 0 
converges, as the sample size increases, to a distribution degenerate at the true value. 
This is very satisfying, as it indicates that Bayesian inference methods are consistent. 


7.2.1 | Estimation 


Suppose now that we want to calculate an estimate of a characteristic of interest w (0). 
We base this on the posterior distribution of this quantity. There are several different 
approaches to this problem. 

Perhaps the most natural estimate is to obtain the posterior density (or probability 
function when relevant) of w and use the posterior mode Y, i.e., the point where the 
posterior probability or density function of y takes its maximum. In the discrete case, 
this is the value of y with the greatest posterior probability; in the continuous case, 
it is the value that has the greatest amount of posterior probability in short intervals 
containing it. 

To calculate the posterior mode, we need to maximize w(y |s) as a function of y. 
Note that it is equivalent to maximize m(s)a@(y |s) so that we do not need to compute 
the inverse normalizing constant to implement this. In fact, we can conveniently choose 
to maximize any function that is a 1-1 increasing function of w(- |s) and get the same 
answer. In general, w(-|s) may not have a unique mode, but typically there is only 
one. 

An alternative estimate is commonly used and has a natural interpretation. This is 
given by the posterior mean 


E(y@)|s), 


whenever this exists. When the posterior distribution of y is symmetrical about its 
mode, and the expectation exists, then the posterior expectation is the same as the 
posterior mode; otherwise, these estimates will be different. If we want the estimate to 
reflect where the central mass of probability lies, then in cases where œ(- |s) is highly 
skewed, perhaps the mode is a better choice than the mean. We will see in Chapter 8, 
however, that there are other ways of justifying the posterior mean as an estimate. 

We now consider some examples. 


EXAMPLE 7.2.2 Bernoulli Model 
Suppose we observe a sample (x1, ...,X,) from the Bernoulli(@) distribution with 0 € 
[0, 1] unknown and we place a Beta(a, p) prior on 8. In Example 7.1.1, we determined 
the posterior distribution of 0 to be Beta(nx + a,n (1 —x) + £). Let us suppose that 
the characteristic of interest is y(@) = 0. 

The posterior expectation of 0 is given by 


388 Section 7.2: Inferences Based on the Posterior 


E@ |x1, s+e5 Xn) 
1 
-f 9 T(n+a +£) g"ž+ta—1 (1 — gy" 0-Ð+-1 gg 
o Tax+aran( —x)+ $£) 
1 
= - T(n +at+ B) gixta a _ ee ae do 
Pax +a) aA —x) + £) Jo 
_ T(n+a +£) Tax +a +1)r(a(l -—x)+ £) 
T(nx +a) (a(l — x) + £) Taat+at+f+1) 
_ mx+a 
~ n+a+ß 
When we have a uniform prior, i.e., a = J = 1 , the posterior expectation is given by 
r+1 
ACES een 
n+2 


To determine the posterior mode, we need to maximize 
Ing™*+e-! (1 —oy"C-4)+8-1 = ng +a — 1)ln8 + (n (1 —¥) +£ — 1)ln (1 - 8). 
This function has first derivative 
nx+a—1 n(l—-x)+f-1 
0 1—98 
and second derivative 


nxta-1l1 n(l-—-¥)+ß-! 


0? (1-0) 
Setting the first derivative equal to 0 and solving gives the solution 
A nx +a—l1 
—n+tatp—-2 


Now, if a > 1, 8 > 1, we see that the second derivative is always negative, and so 6 
is the unique posterior mode. The restriction on the choice of a > 1, 8 > 1 implies 
that the prior has a mode in (0, 1) rather than at 0 or 1. Note that when a = 1, 6 = 1, 
namely, when we put a uniform prior on @, the posterior mode is Ô = x. This is the 
same as the maximum likelihood estimate (MLE). 

The posterior is highly skewed whenever nx + a and (1 — x) + £ are far apart 
(plot Beta densities to see this). Thus, in such a case, we might consider the posterior 
mode as a more sensible estimate of 9. Note that when n is large, the mode and the 
mean will be very close together and in fact very close to the MLE x. E 


EXAMPLE 7.2.3 Location Normal Model 
Suppose that (x1, ... , Xn) is a sample from an N (u, o?) distribution, where u € R! is 


unknown and o? is known, and we take the prior distribution on u to be N (u, T). Let 
us suppose, that the characteristic of interest is w(u) = u. 


Chapter 7: Bayesian Inference 389 


In Example 7.1.2 we showed that the posterior distribution of u is given by the 


= = 
1 n n 1 n 
to 99 t 99 To 9 


distribution. Because this distribution is symmetric about its mode, and the mean exists, 
the posterior mode and mean agree and equal 


-1 
1 n n 

(3+3) (25s) 
To Tp T 9G 


This is a weighted average of the prior mean and the sample mean and lies between 
these two values. 

When n is large, we see that this estimator is approximately equal to the sample 
mean x, which we also know to be the MLE for this situation. Furthermore, when we 
take the prior to be very diffuse, namely, when te is very large, then again this estimator 
is close to the sample mean. 

Also observe that the ratio of the sampling variance of x to the posterior variance 


of u is 
2 2 
o 1 n oO 
=< ats =1+—, 
n TO an) NT 


is always greater than 1. The closer to is to 0, the larger this ratio is. Furthermore, as 
T2 — 0, the Bayesian estimate converges to 4o. 

If we are pretty confident that the population mean wu is close to the prior mean Wo, 
we will take t small so that the bias in the Bayesian estimate will be small and its 
variance will be much smaller than the sampling variance of x. In such a situation, the 
Bayesian estimator improves on accuracy over the sample mean. Of course, if we are 
not very confident that u is close to the prior mean jo, then we choose a large value 
for ae and the Bayesian estimator is basically the MLE. E 


EXAMPLE 7.2.4 Multinomial Model 


Suppose we have a sample (s1, ...,Sn) from the model discussed in Example 7.1.3 
and we place a Dirichlet(a1, a2, ..., ag) distribution on (01, ..., 9¢—1). The posterior 
distribution of (@1,..., Ax—1) is then 


Dirichlet; + a1, x2 + a2, ..., Xk + ak), 


where x; is the number of responses in the ith category. 

Now suppose we are interested in estimating w(0) = 61, the probability that 
a response is in the first category. It can be shown (see Problem 7.2.25) that, if 
(01, ..., Og—1) is distributed Dirichlet(a1, a2,..., ax), then 6; is distributed 


Dirichlet(a;, a_;) = Beta(a;, a—i) 


where a—; = a1 +a2+---+a, —a;. This result implies that the marginal posterior 
distribution of 9; is 


Beta(x; +01, X2 +: +x, +a2+-+-+ax). 


390 Section 7.2: Inferences Based on the Posterior 


Then, assuming that each a; > 1, and using the argument in Example 7.2.2 and 
xı +--+- +x, = n, the marginal posterior mode of 0; is 


ĝ = xj +aı— 1 
ee +o +a 
When the prior is the uniform, namely, a; =--- = ag = 1, then 
n xı 
01 = ———_-.. 
i n+k—2 


As in Example 7.2.2, we compute the posterior expectation to be 


xı +a) 


CHE ee al 
Er Sagas ee 


The posterior distribution is highly skewed whenever x; +a1 and x2 +- - -+x +02 + 
--- + a, are far apart. 

From Problem 7.2.26, we have that the plug-in MLE of 6, is x;/n. When n is 
large, the Bayesian estimates are close to this value, so there is no conflict between the 
estimates. Notice, however, that when the prior is uniform, then aj +--- + ak =k, 
hence the plug-in MLE and the Bayesian estimates will be quite different when k is 
large relative to n. In fact, the posterior mode will always be smaller than the plug-in 
MLE when k > 2 and x; > 0. This is a situation in which the Bayesian and frequentist 
approaches to inference differ. 

At this point, the decision about which estimate to use is left with the practitioner, 
as theory does not seem to provide a clear answer. We can be comforted by the fact 
that the estimates will not differ by much in many contexts of practical importance. E 


EXAMPLE 7.2.5 Location-Scale Normal Model 
Suppose that (x1, .. . , Xn) is a sample from an N (u, 07) distribution, where u € R! 
and ø > 0 are unknown, and we use the prior given in Example 7.1.4. Let us suppose 
that the characteristic of interest is y (u, c°) = u. 

In Example 7.2.1, we derived the marginal posterior distribution of u to be the 


same as the distribution of 
1 2B. 
+. /—— /— T 
es \ 2ao +n n+1/r2 


where T ~ t(n + 2ao). This is at (n + 2ao) distribution relocated to have its mode at 
Hx and rescaled by the factor 


1 28x 
2ao +n n+1/t2 


So the marginal posterior mode of u is 


-1 
1 = 
m=(1+) (g+); 
To To 


Chapter 7: Bayesian Inference 391 


Because af distribution is symmetric about its mode, this is also the posterior mean of 
4u, provided that n + 2a9 > 1, as at(A) distribution has a mean only when å > 1 (see 
Problem 4.6.16). This will always be the case as the sample size n > 1. Again, u, is a 
weighted average of the prior mean uo and the sample average x. 

The marginal posterior mode and expectation can also be obtained for y(u, 07) = 
o°. These computations are left to the reader (see Exercise 7.2.4). E 


One issue that we have not yet addressed is how we will assess the accuracy of 
Bayesian estimates. Naturally, this is based on the posterior distribution and how con- 
centrated it is about the estimate being used. In the case of the posterior mean, this 
means that we compute the posterior variance as a measure of spread for the posterior 
distribution of y about its mean. For the posterior mode, we will discuss this issue 
further in Section 7.2.3. 


EXAMPLE 7.2.6 Posterior Variances 
In Example 7.2.2, the posterior variance of @ is given by (see Exercise 7.2.6) 


(nx +a) (n(1 —x)+ £) 
(@tatpy(ntat+fhtl) 
Notice that the posterior variance converges to 0 as n > oo. 
In Example 7.2.3, the posterior variance is given by (1/ t +n/ o)l. Notice that 
the posterior variance converges to 0 as tå — 0 and converges to o? /n, the sampling 


variance of x, as tå > œ. 
In Example 7.2.4, the posterior variance of 01 is given by (see Exercise 7.2.7) 


(x1 +01) 2 +- +x +02 +-->+4x) 
(n+ai +- +a (n +ai t +ak+1) 


Notice that the posterior variance converges to 0 as n > œ. 
In Example 7.2.5, the posterior variance of u is given by (see Problem 7.2.28) 


(1) op: (22,)- 2B, ( 1 ) 
n+2a0)\n4+1/72 J\n+2a0-2)  \n41/12 J\n+2a0-2)’ 


provided n + 2a9 > 2, because the variance of a t (4) distribution is 2/(A — 2) when 
A > 2 (see Problem 4.6.16). Notice that the posterior variance goes to 0 as n — oo. E 


7.2.2 | Credible Intervals 


A credible interval, for a real-valued parameter w (0), is an interval C (s) = [/(s), u(s)] 
that we believe will contain the true value of y. As with the sampling theory approach, 
we specify a probability y and then find an interval C(s) satisfying 


(yw (0) e C(s)|s) =I K8 : Is) < w@) < uls) |s) >y. (7.2.3) 


We then refer to C(s) as a y -credible interval for y. 


392 Section 7.2: Inferences Based on the Posterior 


Naturally, we try to find a y -credible interval C (s) so that II (w(@) € C(s) |s) is 
as close to y as possible, and such that C(s) is as short as possible. This leads to the 
consideration of highest posterior density (HPD) intervals, which are of the form 


C(s) ={yw: @(w|s) = ch, 


where œ (- |s) is the marginal posterior density of y and where c is chosen as large as 
possible so that (7.2.3) is satisfied. In Figure 7.2.1, we have plotted an example of an 
HPD interval for a given value of c. 


o(y|s) A 


{ i 
(s) u(s) y 


Figure 7.2.1: An HPD interval C(s) = [/(s), u(s)] = {y : œ (y |s) > c}. 


Clearly, C(s) contains the mode whenever c < maxy œ (w |s). We can take the 
length of an HPD interval as a measure of the accuracy of the mode of œ (- |s) as an 
estimator of y (0). The length of a 0.95-credible interval for y will serve the same 
purpose as the margin of error does with confidence intervals. 

Consider now some applications of the concept of credible interval. 


EXAMPLE 7.2.7 Location Normal Model 

Suppose that (x1, ..., Xn) is a sample from an N (u, o?) distribution, where u € R! is 
unknown and o? is known, and we take the prior distribution on u to be N (uo, Tê). In 
Example 7.1.2, we showed that the posterior distribution of u is given by the 


= -1 
1 n n 1 n 
To Tp To 9% T 9 


distribution. Since this distribution is symmetric about its mode (also mean) jz, a short- 
est y-HPD interval is of the form 


—1/2 
A 1 4 n 
oe ws = C; 
AT E 


Chapter 7: Bayesian Inference 393 


where c is such that 


II 
jam 
l 
a 
IA 
ATN 
ni| | 
+ 
tol = 
Noe 
= 
N 
sate. 
= 
| 
= 
— 
A 
a 
2 
Z 
3 


Since 


we have y = ®(c) — ®(—c), where © is the standard normal cumulative distribution 
function (cdf). This immediately implies that c = z(,4,)/2 and the y -HPD interval is 


given by 
i -1 ; -1/2 
n Lo n _ n 
5+5 St+aFJtl ats Z(1+y)/2- 
(= z) (5 Ze) (= z) SR 


Note that as Tå — œ, namely, as the prior becomes increasingly diffuse, this 
interval converges to the interval 


z 00 
XE Jn (HV 


which is also the y -confidence interval derived in Chapter 6 for this problem. So under 
a diffuse normal prior, the Bayesian and frequentist approaches agree. E 


EXAMPLE 7.2.8 Location-Scale Normal Model 

Suppose that (x1, ... , Xn) is a sample from an N (u, 07) distribution, where u € R! 
and o > 0 are unknown, and we use the prior given in Example 7.1.4. In Example 
7.2.1, we derived the marginal posterior distribution of u to be the same as 


1 2 
ux +,/—— n, 
2ao+n\ n+1/t5 


where T ~ t(2aọ +n). Because this distribution is symmetric about its mode y,, a 
y -HPD interval is of the form 


1 2B. 
x — oe 
fx y Zao tny n+ 1/12 


394 Section 7.2: Inferences Based on the Posterior 


where c satisfies 


n T 2i 
E€ ——x.c]] x,..., xX 
HI EEEN Dap Eny na y | 


—1/2 

2p, ; 
J E 7 i Qao +n) (n + 1/22) = < getty 
e a a) (u - È) Se] x1,-.-5Xn 


Grag+n (c) = Grag+n (-c). 


~~ 
Il 


Here, G2ag+n is the t (2ao + n) cdf, and therefore c = f(14y)/2(2a0 +n). 
Using (7.1.9), (7.1.10), and (7.1.11) we have that this interval converges to the 
interval 


n-l_ s 
x +, /——_—~ t (n+2a 

Gay +n) yn t0 
as Tọ > co and fọ — 0. Note that this is a little different from the y -confidence 
interval we obtained for u in Example 6.3.8, but when ao/7 is small, they are virtually 
identical. E 


In the examples we have considered so far, we could obtain closed-form expres- 
sions for the HPD intervals. In general, this is not the case. In such situations, we have 
to resort to numerical methods to obtain the HPD intervals, but we do not pursue this 
topic further here. 

There are other methods of deriving credible intervals. For example, a common 
method of obtaining a y -credible interval for y is to take the interval [w;, y,.] where 
y, isa (l — y) /2 quantile for the posterior distribution of y and y, isa 1—(1 — y) /2 
quantile for this distribution. Alternatively, we could form one-sided intervals. These 
credible intervals avoid the more extensive computations that may be needed for HPD 
intervals. 


7.2.3 | Hypothesis Testing and Bayes Factors 


Suppose now that we want to assess the evidence in the observed data concerning 
the hypothesis Ho : w(@) = wo. It seems clear how we should assess this, namely, 
compute the posterior probability 


II(y(@) = yols). (7.2.4) 


If this is small, then conclude that we have evidence against Ho. We will see further 
justification for this approach in Chapter 8. 


EXAMPLE 7.2.9 
Suppose we want to assess the evidence concerning whether or not 0 e A. If we let 
y = I4, then we are assessing the hypothesis Ho : y (0) = 1 and 


T(y(0) = 1|s) = (A |s). 


So in this case, we simply compute the posterior probability that 0 € A.E 


Chapter 7: Bayesian Inference 395 


There can be a problem, however, with using (7.2.4) to assess a hypothesis. For 
when the prior distribution of y is absolutely continuous, then IT(w(@) = wo |s) =0 
for all data s. Therefore, we would always find evidence against Hp no matter what 
is observed, which does not make sense. In general, if the value yo is assigned small 
prior probability, then it can happen that this value also has a small posterior probability 
no matter what data are observed. 

To avoid this problem, there is an alternative approach to hypothesis assessment that 
is sometimes used. Recall that, if yo is a surprising value for the posterior distribution 
of y, then this is evidence that Hp is false. The value yo is surprising whenever it 
occurs in a region of low probability for the posterior distribution of y. A region of low 
probability will correspond to a region where the posterior density œ (- | s) is relatively 
low. So, one possible method for assessing this is by computing the (Bayesian) P-value 


IGO : oly (O) |s) < a(yols)} ls). (7.2.5) 


Note that when @(- |s) is unimodal, (7.2.5) corresponds to computing a tail probability. 
If the probability (7.2.5) is small, then wọ is surprising, at least with respect to our 
posterior beliefs. When we decide to reject Ho whenever the P-value is less than 1 — y , 
then this approach is equivalent to computing a y -HPD region for y and rejecting Ho 
whenever Wo is not in the region. 


EXAMPLE 7.2.10 (Example 7.2.9 continued) 
Applying the P-value approach to this problem, we see that y(@) = 74 (0) has pos- 
terior given by the Bernoulli(II(A |s)) distribution. Therefore, w(-|s) is defined by 
a(0|s) =1—-TII(A|s) = (4S |s) and@(1|s) = H(4 |s). 

Now wọ = 1, so 


{0 : oly Ols) <@(|s)} = {8:04 @) ls) < UU |s) 
_ fa II(A|s) > TI(4 |s) 
E A I(A |s) < TI(A%|s). 


Therefore, (7.2.5) becomes 


MO: avaisi aa mao a, 


so again we have evidence against Hp whenever ITI (4A |s) is small. E 


We see from Examples 7.2.9 and 7.2.10 that computing the P-value (7.2.5) is essen- 
tially equivalent to using (7.2.4), whenever the marginal parameter y takes only two 
values. This is not the case whenever y takes more than two values, however, and the 
statistician has to decide which method is more appropriate in such a context. 

As previously noted, when the prior distribution of y is absolutely continuous, 
then (7.2.4) is always 0, no matter what data are observed. As the following example 
illustrates, there is also a difficulty with using (7.2.5) in such a situation. 


EXAMPLE 7.2.11 
Suppose that the posterior distribution of 0 is Beta(2, 1), i.e., œ (0 |s) = 20 when 
0 <0 < 1, and we want to assess Hy : 0 = 3/4. Then œ(0 |s) < w(3/4|s) if and 


396 Section 7.2: Inferences Based on the Posterior 
only if 0 < 3/4, and (7.2.5) is given by 
3/4 
f 20 d0 = 9/16. 
0 


On the other hand, suppose we make a 1—1 transformation to p = 67 so that 
the hypothesis is now Ho : p = 9/16. The posterior distribution of p is Beta(1, 1). 
Since the posterior density of p is constant, this implies that the posterior density at 
every possible value is less than or equal to the posterior density evaluated at 9/16. 
Therefore, (7.2.5) equals 1, and we would never find evidence against Hp using this 
parameterization. 

This example shows that our assessment of Ho via (7.2.5) depends on the parame- 
terization used, which does not seem appropriate. E 


The difficulty in using (7.2.5), as demonstrated in Example 7.2.11, only occurs with 
continuous posterior distributions. So, to avoid this problem, it is often recommended 
that the hypothesis to be tested always be assigned a positive prior probability. As 
demonstrated in Example 7.2.10, the approach via (7.2.5) is then essentially equivalent 
to using (7.2.4) to assess Ho. 

In problems where it seems natural to use continuous priors, this is accomplished by 
taking the prior IT to be a mixture of probability distributions, as discussed in Section 
2.5.4, namely, the prior distribution equals 


I = pil; + (1 — p) M2, 


where IMı(y (0) = wo) = 1 and IM2(w (0) = wo) = 0, i.e., Iı is degenerate at wo 
and II? is continuous at wo. Then 


(yw) = wo) = plli(y@) = wo) + (C — p) Ih(y@) = wo) =p > 9 


is the prior probability that Ho is true. 
The prior predictive for the data s is then given by 


m(s) = pmi(s) + (1 — p)m2(s), 


where m; is the prior predictive obtained via prior I; (see Problem 7.2.34). This im- 
plies (see Problem 7.2.34) that the posterior probability measure for 0, when using the 
prior IT, is 


II (A|s) 


pmy(s) (1 — p)ma(s) 


Sa Be et F So ee 
pm) +0 — pm 4!) t+ Gai + — ame 2!) 02) 


where IT;(-|s) is the posterior measure obtained via the prior II;. Note that this a 
mixture of the posterior probability measures IT, (- |s) and II2(- |s) with mixture prob- 
abilities 
pmi(s) and =P) a8) 
pm,(s) + (1 — p)m(s) pm,(s) + (1 — p)m2(s) 


Chapter 7: Bayesian Inference 397 


Now II (-|s) is degenerate at wo (if the prior is degenerate at a point then the posterior 
must be degenerate at that point too) and II2(- |s) is continuous at yo. Therefore, 


pm\(s) 


VO MO) = eae 


(7.2.7) 
and we use this probability to assess Hp. 
The following example illustrates this approach. 


EXAMPLE 7.2.12 Location Normal Model 

Suppose that (x1, ..., Xn) is a sample from an N(x, as) distribution, where u € R! 
is unknown and c? is known, and we want to assess the hypothesis Hp : u = uo. As 
in Example 7.1.2, we will take the prior for u to be an N (uọ, T2) distribution. Given 
that we are assessing whether or not u = uo, it seems reasonable to place the mode of 
the prior at the hypothesized value. The choice of the hyperparameter To then reflects 
the degree of our prior belief that Ho is true. We let [Iz denote this prior probability 
measure, i.e., Ip is the N (uo, o?) probability measure. 

If we use II? as our prior, then, as shown in Example 7.1.2, the posterior distribution 
of is absolutely continuous. This implies that (7.2.4) is 0. So, following the preceding 
discussion, we consider instead the prior II = pH + (1 — p)II2 obtained by mixing 
II2 with a probability measure II; degenerate at wo. Then TMi (uo) = | and so 
I1({ uo} = p. As shown in Example 7.1.2, under IT2 the posterior distribution of u is 


-1 =I 
1 n n 1 n 

N (3+3) (2+2) (3+2) > 
T 0 To 9 To T0 


while the posterior under IT, is the distribution degenerate at uo. We now need to 
evaluate (7.2.7), and we will do this in Example 7.2.13. B 
Bayes Factors 


Bayes factors comprise another method of hypothesis assessment and are defined in 
terms of odds. 


Definition 7.2.1 In a probability model with sample space S and probability mea- 
sure P, the odds in favor of event A C S is defined to be P(4)/P (4°), namely, the 


ratio of the probability of A to the probability of A°. 


Obviously, large values of the odds in favor of A indicate a strong belief that A is true. 
Odds represent another way of presenting probabilities that are convenient in certain 
contexts, e.g., horse racing. Bayes factors compare posterior odds with prior odds. 


Definition 7.2.2 The Bayes factor B Fy, in favor of the hypothesis Ho : y (0) = wo 
is defined, whenever the prior probability of Ho is not 0 or 1, to be the ratio of the 
posterior odds in favor of Ho to the prior odds in favor of Ho, or 


T(w@) = wols) | /| T(y@) = yo) i 


Pee NON pm AUN nE SLE 7.2.8 
1 — TI (y (0) = yo ls) 1 — IH (y (0) = wo) vas 


BFm = | 


398 Section 7.2: Inferences Based on the Posterior 


So the Bayes factor in favor of Hp is measuring the degree to which the data have 
changed the odds in favor of the hypothesis. If B Fp, is small, then the data are provid- 
ing evidence against Ho and evidence in favor of Ho when B Fpp is large. 

There is a relationship between the posterior probability of Ho being true and 
BFy,. From (7.2.8), we obtain 


rBFr 
IO) = ywols) = TarBF;,’ (7.2.9) 
0 
where 
T(y@) = wo) 


~ T= T(v@) = wo) 


is the prior odds in favor of Ho. So, when B Frp is small, then II(w(@) = wols) is 
small and conversely. 

One reason for using Bayes factors to assess hypotheses is the following result. 
This establishes a connection with likelihood ratios. 


Theorem 7.2.1 Ifthe prior IT is a mixture II = pII;+( — p) Ib», where IT; (A) = 
1, TI2(A©) = 1, and we want to assess the hypothesis Ho : 0 € A, then 


B Fig = mı(s)/m2(s), 


where m; is the prior predictive of the data under I;. 


PROOF | Recall that, if a prior concentrates all of its probability on a set, then the 
posterior concentrates all of its probability on this set, too. Then using (7.2.6), we have 


pry - GI) (4) pms) p _ m() 
© T=T1(4|s)/ 1-04) = p)ms)/ Tp m) 


Interestingly, Theorem 7.2.1 indicates that the Bayes factor is independent of p. We 
note, however, that it is not immediately clear how to interpret the value of B Fm. In 
particular, how large does B Fp have to be to provide strong evidence in favor of Ho? 
One approach to this problem is to use (7.2.9), as this gives the posterior probability 
of Ho, which is directly interpretable. So we can calibrate the Bayes factor. Note, 
however, that this requires the specification of p. 


EXAMPLE 7.2.13 Location Normal Model (Example 7.2.12 continued) 
We now compute the prior predictive under II7. We have that the joint density of 
(X1,...,%n) given u equals 


n—-1 n 
ro)” exp{ — 7 s? Jexp = G= uy 
205 205 


Chapter 7: Bayesian Inference 399 


and so 
m(xX1,.--5Xn) 
2) 77/2 nhs Yee 
œ | rog) exp ( Mgr exp ( z (x w?) 


3 -1/2 2 
-00 x (2212)! exp = (u = 10)’) 
n—1l 
= (2r o2)" exp[ — s? 
( 0) p 202 


—1 —1/2 s n z 2 1 2 
x to (27) f_o rae ae) exp 5,2 (# — #0) du. 
7 0 0 


Then using (7.1.2), we have 


du 


-1/2 
1 u? nx? n 1 í 
x exp = eee ata : (7.2.10) 
t fọ Gg. To 


Therefore, 


m2(X1,...,Xn) 


-1 2 
_ n—-1 = 1/1 n Ho n 
= (2003)-"/? exp] — FE s? T) | exp 5 io +e 
Og TO Og To oo 
Lf ue ng ie aire 
SRPA DN? gt ya oep 
0 0 0 0 


Because II; is degenerate at yg, it is immediate that the prior predictive under IT; 


is given by 


n—-1l n 2 
m(X1,..-,Xn) = roZ)" exp{ — s? expl -— (x - . 
1@1 n) = 2209) p 22 p z3 | Ho) 


Therefore, B Fm equals 
n 2 
exp] -— (x — 
( 263 | n’) 


divided by (7.2.10). 


400 Section 7.2: Inferences Based on the Posterior 


For example, suppose that uo = 0, ip =2) o? = l,n = 10, and ¥ = 0.2. Then 


n es 2A 10 2 
exp{ —— (x — = exp | —— (0.2 = 0.81873, 
e( TA w) p(-7 02) 


while (7.2.10) equals 


1 1/1 F 10 (0.2)? 1\ 7⁄2 
— ~(-+10 10 (0.2))? —-——_ } (10+ - 
00(5 (5+ ) aoe »° Joo 5 J (19+ 5) 
= 0.21615. 
So 
0.81873 _ 
~ 0.21615 | 


which gives some evidence in favor of Hp : u = Mo. If we suppose that p = 1/2, 
so that we are completely indifferent between Ho being true and not being true, then 
r = 1, and (7.2.9) gives 


Ha 3.7878, 


3.7878 
T(u = olx, kui ,Xn) = T+ 3.7878 = 0.79114, 


indicating a large degree of support for Ho. E 


7.2.4 | Prediction 


Prediction problems arise when we have an unobserved response value ¢ in a sample 
space T and observed response s € S. Furthermore, we have the statistical model 
{Po : 0 € Q} for s and the conditional statistical model {Qg (-|s) : @ € Q} for t given 
s. We assume that both models have the same true value of 0 € Q. The objective is to 
construct a prediction f(s) € T, of the unobserved value ¢, based on the observed data 
s. The value of t could be unknown simply because it represents a future outcome. 

If we denote the conditional density or probability function (whichever is relevant) 
of ¢ by ga(- |s), the joint distribution of (0, s, t) is given by 


qo (t |s) fo(s)a @). 


Then, once we have observed s (assume here that the distributions of 0 and ¢ are ab- 
solutely continuous; if not, we replace integrals by sums), the conditional density of 
(t, 0), given s, is 


qo lt |s) fo(s)x (0) _ olto fols)z O) _ qoltis) folz (0) 


Ja Sr got ls) fo(s)a(@)dtdd fo folya (0) dd m(s) 


Then the marginal posterior distribution of t, known as the posterior predictive of t, is 


_ f aotls)fols)z@) ,, _ 
ati = | wee d0 = | go(t\s) (01s) 0 


Chapter 7: Bayesian Inference 401 


Notice that the posterior predictive of t is obtained by averaging the conditional density 
oft, given (8, s), with respect to the posterior distribution of 0. 

Now that we have obtained the posterior predictive distribution of t, we can use it to 
select an estimate of the unobserved value. Again, we could choose the posterior mode 
f or the posterior expectation E(t |x) = f rtq (t |s) dt as our prediction, whichever is 
deemed most relevant. 


EXAMPLE 7.2.14 Bernoulli Model 
Suppose we want to predict the next independent outcome X;,41, having observed 
a sample (x1, ...,Xn) from the Bernoulli(@) and 6 ~ Beta(a, f). Here, the future 
observation is independent of the observed data. The posterior predictive probability 
function of X;,41 at t is then given by 


q(t |xX1,..-5,Xn) 
! T(n +a +g) : f 
= a 1-0 1-t grita-l 1-6 n(l—x)+f-1 do 
Í ( ) T(nx +a) an(l —x) + £) ( ) 
T(n +a +£) arkasi n(1-x)+B-+(1—1)-1 
= a T E e EEEE 0 1-0 d0 
(nx +a)T(n(1 —x)+ L) Jo ( ) 
7 Tat+at $£) Tax +a +r —-x)+fh+1-4 
~ Tnx +a) (ni — x) + £) Tat+at+f+1) 
a t=1 
z n(1—¥)+8 t=0 
n+a+ß i 


which is the probability function of a Bernoulli ((nx + a) / (n + a + £)) distribution. 
Using the posterior mode as the predictor, i.e., maximizing q (t |x1,...,Xn) for t, 
leads to the prediction 


= ntat+B — n+a+ß ? 
0 otherwise. 


i i if ee 8 


The posterior expectation predictor is given by 


nx +a 


BAG IST 929) TTET 


Note that the posterior mode takes a value in {0, 1}, and the future X;,4; will be in this 
set, too. The posterior mean can be any value in [0, 1]. E 


EXAMPLE 7.2.15 Location Normal Model 
Suppose that (x1, ... , Xn) is a sample from an N (u, 03) distribution, where 


u € R! is unknown and o? is known, and we use the prior given in Example 7.1.2. 
Suppose we want to predict a future observation X„+1, but this time X„+1 is from the 


=a), 
_f il n 2 
N (4+) 00 (7.2.11) 


402 Section 7.2: Inferences Based on the Posterior 


distribution. So, in this case, the future observation is not independent of the observed 
data, but it is independent of the parameter. A simple calculation (see Exercise 7.2.9) 
shows that (7.2.11) is the posterior predictive distribution of £ and so we would predict 
t by x, as this is both the posterior mode and mean. E 


We can also construct a y -prediction region C(s) for a future value ¢ from the 
model {gg(-|s) : 0 € Q}. A y-prediction region for t satisfies O(C(s) |s > y), where 
Q(-|s) is the posterior predictive measure for t. One approach to constructing C(s) is 
to apply the HPD concept to q (t |s). We illustrate this via several examples. 


EXAMPLE 7.2.16 Bernoulli Model (Example 7.2.14 continued) 
Suppose we want a y -prediction region for a future value Xn+1. In Example 7.2.14, 
we derived the posterior predictive distribution of X„ +1 to be 


Bernoulli Je . 
n+a+ß 


Accordingly, a y -prediction region for t, derived via the HPD concept, is given by 


: z EE 
{0,1} if max { AEE, ne <y, 


Cœ... x)=] 0} ify < max { ie, CORDA = ite, 


; ï 1-x)+ 1—x)+, 
(0) ify < max | te, E) = Oe 


We see that this predictive region contains just the mode or encompasses all possible 
values for X,,41. In the latter case, this is not an informative inference. E 


EXAMPLE 7.2.17 Location Normal Model (Example 7.2.15 continued) 
Suppose we want a y -prediction interval for a future observation X„+1 from a 


>j 
1 n 

Nixf{ sts) 2 
To g 


distribution. As this is also the posterior predictive distribution of X„+}1ı and is sym- 
metric about x, a y -prediction interval for X„+1, derived via the HPD concept, is given 


by 
a) 
_ 1 n 
Be +(4+3) O0Z(1+y)/2- 0 


To 0 


Summary of Section 7.2 


e Based on the posterior distribution of a parameter, we can obtain estimates of 
the parameter (posterior modes or means), construct credible intervals for the 
parameter (HPD intervals), and assess hypotheses about the parameter (posterior 
probability of the hypothesis, Bayesian P-values, Bayes factors). 


Chapter 7: Bayesian Inference 403 


e A new type of inference was discussed in this section, namely, prediction prob- 
lems where we are concerned with predicting an unobserved value from a sam- 
pling model. 


EXERCISES 


7.2.1 For the model discussed in Example 7.1.1, derive the posterior mean of y = 0” 
where m > 0. 

7.2.2 For the model discussed in Example 7.1.2, determine the posterior distribution 
of the third quartile y = u + 0020.75. Determine the posterior mode and the posterior 
expectation of y. 

7.2.3 In Example 7.2.1, determine the posterior expectation and mode of 1 /a. 

7.2.4 In Example 7.2.1, determine the posterior expectation and mode of o”. (Hint: 
You will need the posterior density of ø? to determine the mode.) 

7.2.5 Carry out the calculations to verify the posterior mode and posterior expectation 
of 0; in Example 7.2.4. 

7.2.6 Establish that the variance of the 0 in Example 7.2.2 is as given in Example 7.2.6. 
Prove that this goes to 0 as n > oo. 

7.2.7 Establish that the variance of 0; in Example 7.2.4 is as given in Example 7.2.6. 
Prove that this goes to 0 as n > oo. 

7.2.8 In Example 7.2.14, which of the two predictors derived there do you find more 
sensible? Why? 

7.2.9 In Example 7.2.15, prove that the posterior predictive distribution for X„+1 is as 
stated. (Hint: Write the posterior predictive distribution density as an expectation.) 
7.2.10 Suppose that (x1,...,X,) is a sample from the Exponential(/) distribution, 
where 2 > 0 is unknown and 1 ~ Gamma(ao, fo). Determine the mode of posterior 
distribution of 4. Also determine the posterior expectation and posterior variance of 4. 
7.2.11 Suppose that (x;,...,x,) is a sample from the Exponential(/) distribution 
where 4 > 0 is unknown and 2 ~ Gamma(ag, fo). Determine the mode of poste- 
rior distribution of a future independent observation X;,4 1. Also determine the poste- 
rior expectation of X„+1 and posterior variance of X,41. (Hint: Problems 3.2.16 and 
3.3.20.) 

7.2.12 Suppose that in a population of students in a course with a large enrollment, the 
mark, out of 100, on a final exam is approximately distributed N (u, 9). The instructor 
places the prior u ~ N(65, 1) on the unknown parameter. A sample of 10 marks is 
obtained as given below. 


46 68 34 86 75 56 77 73 53 64 


(a) Determine the posterior mode and a 0.95-credible interval for u. What does this 
interval tell you about the accuracy of the estimate? 

(b) Use the 0.95-credible interval for u to test the hypothesis Hp : u = 65. 

(c) Suppose we assign prior probability 0.5 to u = 65. Using the mixture prior IT = 
0.511; +0.5I12, where IT; is degenerate at u = 65 and II is the N(65, 1) distribution, 
compute the posterior probability of the null hypothesis. 


404 Section 7.2: Inferences Based on the Posterior 


(d) Compute the Bayes factor in favor of Ho : u = 65 when using the mixture prior. 
7.2.13 A manufacturer believes that a machine produces rods with lengths in centime- 
ters distributed N (uo, 07), where uo is known and o? > 0 is unknown, and that the 
prior distribution 1/07 ~ Gamma(ao, Bo) is appropriate. 

(a) Determine the posterior distribution of ø? based on a sample (x1, ..., Xn). 


(b) Determine the posterior mean of o°. 


(c) Indicate how you would assess the hypothesis Ho : o? < o. 


7.2.14 Consider the sampling model and prior in Exercise 7.1.1. 

(a) Suppose we want to estimate 0 based upon having observed s = 1. Determine the 
posterior mode and posterior mean. Which would you prefer in this situation? Explain 
why. 

(b) Determine a 0.8 HPD region for 0 based on having observed s = 1. 

(c) Suppose instead interest was in y (0) = 1;1,2)(0). Identify the prior distribution of 
y. Identify the posterior distribution of y based on having observed s = 1. Determine 
a 0.5 HPD region for wy. 

7.2.15 For an event A, we have that P(A°) = 1 — P(A). 

(a) What is the relationship between the odds in favor of A and the odds in favor of 4°? 


(b) When A is a subset of the parameter space, what is the relationship between the 
Bayes factor in favor of A and the Bayes factor in favor of A°? 

7.2.16 Suppose you are told that the odds in favor of a subset A are 3 to 1. What is the 
probability of A? If the Bayes factor in favor of A is 10 and the prior probability of A 
is 1/2, then determine the posterior probability of A. 

7.2.17 Suppose data s is obtained. Two statisticians analyze these data using the same 
sampling model but different priors, and they are asked to assess a hypothesis Ho. Both 
Statisticians report a Bayes factor in favor of Hp equal to 100. Statistician I assigned 
prior probability 1/2 to Ho whereas statistician II assigned prior probability 1/4 to Ap. 
Which statistician has the greatest posterior degree of belief in Ho being true? 

7.2.18 You are told that a 0.95-credible interval, determined using the HPD criterion, 
for a quantity y(@) is given by (—3.3, 2.6). If you are asked to assess the hypothesis 
Ho : w(@) = 0, then what can you say about the Bayesian P-value? Explain your 
answer. 

7.2.19 What is the range of possible values for a Bayes factor in favor of A C Q? 
Under what conditions will a Bayes factor in favor of A C Q take its smallest value? 


PROBLEMS 


7.2.20 Suppose that (x1, ..., Xn) is a sample from the Uniform[0, 0] distribution, where 
0 > 0 is unknown, and we have 0 ~ Gamma(ao, o). Determine the mode of the pos- 
terior distribution of @. (Hint: The posterior is not differentiable at 0 = x(n).) 

7.2.21 Suppose that (x1, . . . , Xn) is a sample from the Uniform[0, @] distribution, where 
0 e (0, 1) is unknown, and we have 0 ~ Uniform[0, 1]. Determine the form of the y - 
credible interval for 0 based on the HPD concept. 

7.2.22 In Example 7.2.1, write out the integral given in (7.2.2). 


Chapter 7: Bayesian Inference 405 


7.2.23 (MV) In Example 7.2.1, write out ithe integral that you would need to evaluate 
if you wanted to compute the posterior density of the third quartile of the population 
distribution, i.e., y = u +02Z0,75. 

7.2.24 Consider the location normal model discussed in Example 7.1.2 and the popu- 
lation coefficient of variation y = 60/4. 

(a) Show that the posterior expectation of y does not exist. (Hint: Show that we can 
write the posterior expectation as 


a co 1 ee) ae 
oo a + bz J 2 


where b > 0, and show that this integral does not exist by considering the behavior of 
the integrand at z = —a/b.) 


(b) Determine the posterior density of w. 

(c) Show that you can determine the posterior mode of y by evaluating the posterior 
density at two specific points. (Hint: Proceed by maximizing the logarithm of the pos- 
terior density using the methods of calculus.) 

7.2.25 (MV) Suppose that (01, ...,@%—1) ~ Dirichlet(a, a2, ..., ax). 

(a) Prove that (0;,..., 0-2) ~ Dirichlet(a,, a2,..., ak-1 + ap). (Hint: In the inte- 


gral to integrate out 0—1, make the transformation 0,1 > 4-1 /(1—0—- - -—Ox_-2).) 
(b) Prove that 0; ~ Beta(ai, a2 +--- + ax). (Hint: Use part (a).) 

(c) Suppose (i1,..., ix) is a permutation of (1,...,4). Prove that (@;,,...,0i%_,) ~ 
Dirichlet(@;,, a@j,,..., @;,). (Hint: What is the Jacobian of this transformation?) 


(d) Prove that 0; ~ Beta(a;, a_;). (Hint: Use parts (b) and (c).) 

7.2.26 (MV) In Example 7.2.4, show that the plug-in MLE of @ is given by x) /n, i.e., 
find the MLE of (01, ..., 0x) and determine the first coordinate. (Hint: Show there is 
a unique solution to the score equations and then use the facts that the log-likelihood is 
bounded above and goes to —co whenever 6; — 0.) 

7.2.27 Compare the results obtained in Exercises 7.2.3 and 7.2.4. What do you con- 
clude about the invariance properties of these estimation procedures? (Hint: Consider 
Theorem 6.2.1.) 

7.2.28 In Example 7.2.5, establish that the posterior variance of u is as stated in Ex- 
ample 7.2.6. (Hint: Problem 4.6.16.) 

7.2.29 In a prediction problem, as described in Section 7.2.4, derive the form of the 
prior predictive density for ¢ when the joint density of (6,5, t) is qe (t |s) fe(s)z (0) 
(assume s and @ are real-valued). 

7.2.30 In Example 7.2.16, derive the posterior predictive probability function of 
(Xn+41, Xn+2), having observed x1,...,Xn when X1, ..., Xn, Xn+1, Xn+2 are inde- 
pendently and identically distributed (i.i.d.) Bernoulli (0). 

7.2.31 In Example 7.2.15, derive the posterior predictive distribution for Xn+1, having 
observed x1, ...,X, when X1,..., Xn, Xn41 are iid. N (u, a4). (Hint: We can write 
Xn+1 = u +o00Z, where Z ~ N(0, 1) is independent of the posterior distribution of 
u.) 


406 Section 7.2: Inferences Based on the Posterior 


7.2.32 For the context of Example 7.2.1, prove that the posterior predictive distribution 
of an additional future observation X;,41 from the population distribution has the same 
distribution as 


2B, (+ 1/13)! +1) r 


+ 
ue (ao +n) 


where T ~ ¢(2a9 + n). (Hint: Note that we can write X,4; = u + oU, where 
U ~ N(0, 1) independent of X),..., Xn, u, o and then reason as in Example 7.2.1.) 
7.2.33 In Example 7.2.1, determine the form of an exact y -prediction interval for an 
additional future observation X„+1 from the population distribution, based on the HPD 
concept. (Hint: Use Problem 7.2.32.) 

7.2.34 Suppose that IT, and IT are discrete probability distributions on the parameter 
space ©. Prove that when the prior IT is a mixture IT = pI; + (1 — p) Ih, then the 
prior predictive for the data s is given by m(s) = pm,(s) + (1 — p)m2(s), and the 
posterior probability measure is given by (7.2.6). 

7.2.35 (MV) Suppose that 6 = (01,02) € R? and h(01, 02) = (w(@), A(0)) € R?. 
Assume that A satisfies the necessary conditions and establish (7.2.1). (Hint: Theorem 
2.9.2.) 


CHALLENGES 


7.2.36 Another way to assess the null hypothesis Ho : w(@) = wo is to compute the 
P-value 


n (eo — ols) s) (7.2.12) 


o(y@)) ~ ayo) 


where œ is the marginal prior density or probability function of w. We call (7.2.12) the 
observed relative surprise of Hp. 

The quantity @(wo | s)/@(yo) is a measure of how the data s have changed our a 
priori belief that yo is the true value of w. When (7.2.12) is small, yo is a surprising 
value for y, as this indicates that the data have increased our belief more for other 
values of y. 

(a) Prove that (7.2.12) is invariant under 1—1 continuously differentiable transforma- 
tions of y. 

(b) Show that a value wo that makes (7.2.12) smallest, maximizes w(yq | 5)/@(yo). 
We call such a value a least relative suprise estimate of y. 

(c) Indicate how to use (7.2.12) to form a y-credible region, known as a y-relative 
surprise region, for y. 

(d) Suppose that y is real-valued with prior density œ and posterior density (-|s) 
both continuous and positive at wo. Let Ae = (wo — €, Wo + €). Show that BF4. > 
o(Yo|s)/@(yo) as € J 0. Generalize this to the case where wy takes its values in an 
open subset of R*. This shows that we can think of the observed relative surprise as a 
way of calibrating Bayes factors. 


Chapter 7: Bayesian Inference 407 


7.3 | Bayesian Computations 


In virtually all the examples in this chapter so far, we have been able to work out the 
exact form of the posterior distributions and carry out a number of important com- 
putations using these. It often occurs, however, that we cannot derive any convenient 
form for the posterior distribution. Furthermore, even when we can derive the posterior 
distribution, there computations might arise that cannot be carried out exactly — e.g., 
recall the discussion in Example 7.2.1 that led to the integral (7.2.2). These calculations 
involve evaluating complicated sums or integrals. Therefore, when we apply Bayesian 
inference in a practical example, we need to have available methods for approximating 
these quantities. 

The subject of approximating integrals is an extensive topic that we cannot deal 
with fully here.! We will, however, introduce several approximation methods that arise 
very naturally in Bayesian inference problems. 


7.3.1 | Asymptotic Normality of the Posterior 


In many circumstances, it turns out that the posterior distribution of 6 € R! is approx- 
imately normally distributed. We can then use this to compute approximate credible 
regions for the true value of 0, carry out hypothesis assessment, etc. One such re- 
sult says that, under conditions that we will not describe here, when (x1,...,x,) isa 
sample from fg, then 


n(n m 


6 (x1, 12+ Xn) 


oe @(z) 


as n — oo, where 0 (x1, ...,Xn) is the posterior mode, and 


Perme 302 


g 


Note that this result is similar to Theorem 6.5.3 for the MLE. Actually, we can replace 
Ô(x1, --., Xn) by the MLE and replace ê’, .-- Xn) by the observed information 
(see Section 6.5), and the result still holds. When @ is k-dimensional, there is a similar 
but more complicated result. 


7.3.2 | Sampling from the Posterior 


Typically, there are many things we want to compute as part of implementing a Bayesian 
analysis. Many of these can be written as expectations with respect to the posterior dis- 
tribution of 0. For example, we might want to compute the posterior probability content 
ofa subset A C Q, namely, 


H(A |s) = E40) |s). 


l See, for example, Approximating Integrals via Monte Carlo and Deterministic Methods, by M. Evans 
and T. Swartz (Oxford University Press, Oxford, 2000). 


408 Section 7.3: Bayesian Computations 


More generally, we want to be able to compute the posterior expectation of some arbi- 
trary function w (0), namely 
E(w) |s). (7.3.1) 


It would certainly be convenient if we could compute all these quantities exactly, 
but quite often we cannot. In fact, it is not really necessary that we evaluate (7.3.1) 
exactly. This is because we naturally expect any inference we make about the true 
value of the parameter to be subject (different data sets of the same size lead to different 
inferences) to sampling error. It is not necessary to carry out our computations to a 
much higher degree of precision than what sampling error contributes. For example, if 
the sampling error only allows us to know the value of a parameter to within only +0.1 
units, then there is no point in computing an estimate to many more digits of accuracy. 

In light of this, many of the computational problems associated with implementing 
Bayesian inference are effectively solved if we can sample from the posterior for 0. 
For when this is possible, we simply generate an 1.i.d. sequence 0, 42,...,@n from 
the posterior distribution of @ and estimate (7.3.1) by 


2 le 
© = TA 


We know then, from the strong law of large numbers (see Theorem 4.3.2), that w = 
E(w(@) |x) as N > oo. 

Of course, for any given N, the value of w only approximates (7.3.1); we would like 
to know that we have chosen N large enough so that the approximation is appropriately 
accurate. When £E (w? (0) |s) < oo, then the central limit theorem (see Theorem 4.4.3) 
tells us that 

w—E(w@)|s) D 


an + NO, 1) 


as N —> oo, where o2 = Var(w(@) | s). In general, we do not know the value of Ge. 
but we can estimate it by 


1 N 
2-2 > Nao V2 
Sw z N-1 = (wi) w) 


when w (0) is a quantitative variable, and by s2 = wW(1—w) when w = I4 for A C Q. 
As shown in Section 4.4.2, in either case, s2 is a consistent estimate of o2. Then, by 
Corollary 4.4.4, we have that 


w — E(w(0) |s) D 
=n 5 N(0,1) 


as N > œ. 
From this result we know that 


&ı 
H 
es) 


Chapter 7: Bayesian Inference 409 


is an approximate 100% confidence interval for E (w(@) | s), so we can look at 35) /./N 
to determine whether or not N is large enough for the accuracy required. 

One caution concerning this approach to assessing error is that 3s,,/\/N is itself 
subject to error, as sy is an estimate of ow, so this could be misleading. A common 
recommendation then is to monitor the value of 3s, / VN for successively larger values 
of N and stop the sampling only when it is clear that the value of 3s,,/./N is small 
enough for the accuracy desired and appears to be declining appropriately. Even this 
approach, however, will not give a guaranteed bound on the accuracy of the computa- 
tions, so it is necessary to be cautious. 

It is also important to remember that application of these results requires that o2 < 
oo. For a bounded w, this is always true, as any bounded random variable always has 
a finite variance. For an unbounded w, however, this must be checked — sometimes 
this is very difficult to do. 

We consider an example where it is possible to exactly sample from the posterior. 


EXAMPLE 7.3.1 Location-Scale Normal 
Suppose that (x1,...,X,) is a sample from an N (u, 07) distribution where u € R! 
and o > 0 are unknown, and we use the prior given in Example 7.1.4. The posterior 
distribution for (u, 07) developed there is 


UIO, Xi, Xn ~N (ux, 0 +1/17'0?) (1.3.2) 


and 
1/0? |x1,...,Xn ~ Gamma(ao +n/2, B,), (7.3.3) 


where u, is given by (7.1.7) and £, is given by (7.1.8). 

Most statistical packages have built-in generators for gamma distributions and for 
the normal distribution. Accordingly, it is very easy to generate a sample (u1, o7), nir 
(UN, a) from this posterior. We simply generate a value for 1 /o? from the specified 
gamma distribution; then, given this value, we generate the value of u; from the speci- 
fied normal distribution. 

Suppose, then, that we want to derive the posterior distribution of the coefficient 
of variation y = o/y. To do this we generate N values from the joint posterior of 
(u, o°), using (7.3.2) and (7.3.3), and compute y for each of these. We then know 
immediately that y,,..., y y is a sample from the posterior distribution of wy. 

As a specific numerical example, suppose that we observed the following sample 
(x1,..., X15). 


11.6714 1.8957 2.1228 2.1286 1.0751 
8.1631 1.8236 4.0362 6.8513 7.6461 


1.9020 7.4899 4.9233 8.3223 7.9486 
Here, x = 5.2 ands = 3.3. Suppose further that the prior is specified by uo = 4, to = 


2, ao = 2, and fy = 1. 
From (7.1.7), we have 


aise!) (+15:52) =5161 
PAT 2 2 a ESA 


410 Section 7.3: Bayesian Computations 
and from (7.1.8), 


=i 15 5.2) sl l4 33) fis ty" +15 ip): 
Bo = TS Oy pa hs 3) 7 5 5. 


= 77.578. 
Therefore, we generate 
1/o7 |x1,...,Xn ~ Gamma(9.5, 77.578), 


followed by 


ula? Xi,- ., Xn ~ N(5.161, (15.5)~! o°). 


See Appendix B for some code that can be used to generate from this joint distribution. 

In Figure 7.3.1, we have plotted a sample of N = 200 values of (u, 07) from this 
joint posterior. In Figure 7.3.2, we have plotted a density histogram of the 200 values 
of y that arise from this sample. 


25 7 x 
3 x 
x 
O x 
© j x Xa X 
© x x 
5 x 
o an x x x 
(72) 15 x xX xo XA x x 
x x 7 
z x% X Vx x Z 
x x x x 
D X x X XX XIX 9% x 7 
D x © X x xX x XX x 
xx X xX x Pax x 
X, X KX x x 
ER Box E E es X 
xX x X X ROK x 
R KX Z% x 
5 4 x x Xx By x 
x x *y & 
I I I | | 
3 4 5 6 7 


Figure 7.3.1: A sample of 200 values of (u, oa”) from the joint posterior in Example 7.3.1 
whenn = 15, x =5.2,5 = 3.3, uo = 4, i = 2, ao = 2, and fo = 1. 


Chapter 7: Bayesian Inference 411 


Figure 7.3.2: A density histogram of 200 values from the posterior distribution of y in 
Example 7.3.1. 


A sample of 200 is not very large, so we next generated a sample of N = 10° 
values from the posterior distribution of y. A density histogram of these values is pro- 
vided in Figure 7.3.3. In Figure 7.3.4, we have provided a density histogram based on 
a sample of N = 104 values. We can see from this that at N = 10°, the basic shape of 
the distribution has been obtained, although the right tail is not being very accurately 
estimated. Things look better in the right tail for N = 104, but note there are still some 
extreme values quite disconnected from the main mass of values. As is characteristic 
of most distributions, we will need very large values of N to accurately estimate the 
tails. In any case, we have learned that this distribution is skewed to the right with a 
long right tail. 


Figure 7.3.3: A density histogram of 1000 values from the posterior distribution of y in 
Example 7.3.1. 


412 Section 7.3: Bayesian Computations 


Figure 7.3.4: A density histogram of N = 104 values from the posterior distribution of y in 
Example 7.3.1. 


Suppose we want to estimate 


My <0.5|4x1,...,%n) = EU(-co,0.5)(W) | 41, -> Xn). 


Now w = J(~o,0,5) is bounded so its posterior variance exists. In the following table, 
we have recorded the estimates for each N together with the standard error based on 
each of the generated samples. We have included some code for computing these 
estimates and their standard errors in Appendix B. Based on the results from N = 104, 
it would appear that this posterior probability is in the interval 0.289 + 3 (0.0045) = 
[0.2755, 0.3025]. 


Estimate of II (y < 0.5|x1,...,Xn) | Standard Error 


This example also demonstrates an important point. It would be very easy for us to 
calculate the sample mean of the values of y generated from its posterior distribution 
and then consider this as an estimate of the posterior mean of y. But Problem 7.2.24 
suggests (see Problem 7.3.15) that this mean will not exist. Accordingly, a Monte Carlo 
estimate of this quantity does not make any sense! So we must always check first that 
any expectation we want to estimate exists, before we proceed with some estimation 
procedure. ll 


When we cannot sample directly from the posterior, then the methods of the fol- 
lowing section are needed. 


Chapter 7: Bayesian Inference 413 


7.3.3 | Sampling from the Posterior Via Gibbs Sampling (Advanced) 


Sampling from the posterior, as described in Section 7.3.2, is very effective, when it 
can be implemented. Unfortunately, it is often difficult or even impossible to do this 
directly, as we did in Example 7.3.1. There are, however, a number of algorithms that 
allow us to approximately sample from the posterior. One of these, known as Gibbs 
sampling, is applicable in many statistical contexts. 

To describe this algorithm, suppose we want to generate samples from the joint 
distribution of (Y|,..., Yp) € R*. Further suppose that we can generate from each of 
the full conditional distributions Y; | Y_; = y_i, where 


Yop = eed Yai Gases Ys 


namely, we can generate from the conditional distribution of Y; given the values of all 
the other coordinates. The Gibbs sampler then proceeds iteratively as follows. 


1. Specify an initial value (v1@),..., yko) for (M1, ..., Yk). 


2. For N > 0, generate Y;(y) from its conditional distribution given 
OIN) +++» Vi-1(N)» Vi41(N-1)> +--+ Yk(N-1)) for each i =1,...,k. 


For example, if k = 3, we first specify (11(0), ¥2(0), ¥3()). Then we generate 


Yiay|Y20) = 320) BO = o 
Pali = yim, BO = oO 
Byly = ym, Yaay = xa 


to obtain (Y10), Yaa), ¥3a)). Next we generate 


rolha = xm, Y3q) = z 
Polg = via), BY = z 
Bolig = via), PA = o 


to obtain (712), Y2(2), ¥3(2)), etc. Note that we actually did not need to specify Yi (0), 
as it is never used. 

It can then be shown (see Section 11.3) that, in fairly general circumstances, (Yin), 
..., Yqyy) converges in distribution to the joint distribution of (%1,..., Yp) as N > 
oo. So for large N, we have that the distribution of (Yiu), ..., Yk) ) is approximately 
the same as the joint distribution of (%1,..., Yp) from which we want to sample. So 
Gibbs sampling provides an approximate method for sampling from a distribution of 
interest. 

Furthermore, and this is the result that is most relevant for simulations, it can be 
shown that, under conditions, 


_ i as. 
=D eo, --- Yew) > EWN, Ye). 
i=l 


414 Section 7.3: Bayesian Computations 


Estimation of the variance of w is different than in the i.i.d. case, where we used the 

sample variance, because now the w(Y1¢), - - -, Yxq@)) terms are not independent. 
There are several approaches to estimating the variance of w, but perhaps the most 

commonly used is the technique of batching. For this we divide the sequence 


w(Y1(0), -- -> Yk), --- WYN), -- -> Yay) 


into N/m nonoverlapping sequential batches of size m (assuming here that N is divisi- 
ble by m), calculate the mean in each batch obtaining w1, ..., W/m, and then estimate 


the variance of w by 
2 


5b (7.3.4) 
N/m’ 2 


where s is the sample variance obtained from the batch means, i.e., 


N/m 


1 = z 
5b = Wg A N 


i=l 


It can be shown that (Y1(i),..., Yka)) and (YiG+m), -- - , Yk(i+m)) are approximately 
independent for m large enough. Accordingly, we choose the batch size m large enough 
so that the batch means are approximately independent, but not so large as to leave 
very few degrees of freedom for the estimation of the variance. Under ideal conditions, 


W1, ..., WN/m İS an i.i.d. sequence with sample mean 
p 1 N/m F 
w = — Ùi 
N/m , 


and, as usual, we estimate the variance of w by (7.3.4). 

Sometimes even Gibbs sampling cannot be directly implemented because we can- 
not obtain algorithms to generate from all the full conditionals. There are a variety 
of techniques for dealing with this, but in many statistical applications the technique 
of latent variables often works. For this, we search for some random variables, say 
(Vi, ..., V1), where each Y; is a function of (V1, ..., V1) and such that we can apply 
Gibbs sampling to the joint distribution of (V1, ..., Vr) . We illustrate Gibbs sampling 
via latent variables in the following example. 


EXAMPLE 7.3.2 Location-Scale Student 

Suppose now that (x1, ..., Xn) is a sample from a distribution that is of the form X = 
u+oZ, where Z ~ t(A) (see Section 4.6.2 and Problem 4.6.14). If 2 > 2, then u is 
the mean and ø (A/(A —2))!/7 is the standard deviation of the distribution (see Problem 
4.6.16). Note that 2 = co corresponds to normal variation, while 4 = 1 corresponds to 
Cauchy variation. 

We will fix 4 at some specified value to reflect the fact that we are interested in 
modeling situations in which the variable under consideration has a distribution with 
longer tails than the normal distribution. Typically, this manifests itself in a histogram 
of the data with a roughly symmetric shape but exhibiting a few extreme values out in 
the tails, so a ¢(A) distribution might be appropriate. 


Chapter 7: Bayesian Inference 415 


Suppose we place the prior on (u, 07), given by u |o? ~ N (uo, trôa?) and 1/0? ~ 
Gamma (ao, Bo). The likelihood function is given by 


1 n/2 n we u 2770+1)/2 
= 14+-({- : 7.3.5 
E thae] 035) 


hence the posterior density of (u, 1/o7) is proportional to 


—(A+1)/2 
1 n/2 n 1 xi— u 2 
= I+- x 
E ipae) 
1 1/2 1 J 1 ao—l Bo 
Ta (2) ao(-8) 


This distribution is not immediately recognizable, and it is not at all clear how to gen- 
erate from it. 

It is natural, then, to see if we can implement Gibbs sampling. To do this directly, 
we need an algorithm to generate from the posterior of u given the value of o7, and 
an algorithm to generate from the posterior of ø? given u. Unfortunately, neither of 
these conditional distributions is amenable to the techniques discussed in Section 2.10, 
so we cannot implement Gibbs sampling directly. 

Recall, however, that when V ~ 77(A) = Gamma(A/2, 1/2) (see Problem 4.6.13) 
independent of Y ~ N (u, 07), then (Problem 4.6.14) 


Therefore, writing 


Y 
X="utoZ="u+o 


orn IT 


we have that X | V =v ~ N(u,07A/v). 

We now introduce the n latent or hidden variables (V1, ..., Vn), which are i.i.d. 
x° (A) and suppose X; | V; = vj ~ N(u,07A/v;). The V; are considered latent be- 
cause they are not really part of the problem formulation but have been added here for 
convenience (as we shall see). Then, noting that there is a factor py}! ? associated with 
the density of X; | V; = v;, the joint density of the values (X1, V1), ..., (Xn, Vn) is 
proportional to 


Ly? 7 vi 2). G/2)—(1/2) vi 
(<2) [oe (-gry 6-0) of? exp (3). 


From the above argument, the marginal joint density of (X1, ..., Xn) (after integrating 
out the v;’s) is proportional to (7.3.5), namely, a sample of n from the distribution 


416 Section 7.3: Bayesian Computations 


specified by X = u + o Z, where Z ~ ¢(A). With the same prior structure as before, 
we have that the joint density of 


(Xi, Va); -o.s (Xn, Vn) > Ms 1/0? 


is proportional to 
Lae vi 2) „0/90 vi 
; 2/2)—(1/2) ( :) 
= exp (—~—— (x; — , — 
(=z) I] p( 371 w’) v! sa a E 


1 1/2 1 1 ao-l B 
(z) exp (=> (u — w) (z) exp (-5) . (7.3.6) 


In (7.3.6), treat x1, ...,Xn aS constants (we observed these values) and consider 
the conditional distributions of each of the variables V1,..., Vn, u, 1/ o? given all the 
other variables. From (7.3.6), we have that the full conditional density of u is propor- 


tional to 
1 4 vi 2 1 p) 
af- gTa PFA ; 


i=l 
which is proportional to 


1 - Vi 1 2 2 i Vi Ho 


From this, we immediately deduce that 


2 
ul Sosas V1,...,0n, o 


~n (ronen (Ze) ron oe); 
i=l To 
n j l 
vi 
r(v1,...,0n) = —}+— ; 
(e 


From (7.3.6), we have that the conditional density of 1/ø? is proportional to 


ae (n/2)-+ao—(1/2) ne Die i = Hy en 
o? p pa (u — no) +2Bo J202? |’ 


and we immediately deduce that 


where 


1 
tea O1,..+, Un, U 


n 1 1fQo; 2 1 2 
~ Gana ($0045. 5( 3 2 -w +5 (# — Ho) +28 |}. 


i=l 0 


Chapter 7: Bayesian Inference 417 


Finally, the conditional density of V; is proportional to 


2 
a i — 1 
D RD |- E j A nf 


and it is immediate that 


2 
Vi |X1,.-.5Xn, Ul, ..., Vi-1, Vi41,--+, Un, H,O 


A 1 1f@i-—p? 
~ Gammaf 2 +4,- S yp, 
ve a+ 5-3( yas 


We can now easily generate from all these distributions and implement a Gibbs 
sampling algorithm. As we are not interested in the values of Vj,..., Vn, we simply 
discard these as we iterate. 

Let us now consider a specific computation using the same data and prior as in 
Example 7.3.1. The analysis of Example 7.3.1 assumed that the data were coming from 
a normal distribution, but now we are going to assume that the data are a sample from 
a u + ot(3) distribution, i.e., 2 = 3. We again consider approximating the posterior 
distribution of the coefficient of variation y = o/u. 

We carry out the Gibbs sampling iteration in the order v1,..., vn, 4, 1/ oa”. This 
implies that we need starting values only for u and o? (the full conditionals of the v; 
do not depend on the other v ;). We take the starting value of u to be x = 5.2 and the 
starting value of ø to be s = 3.3. For each generated value of (u, 07), we calculate y 
to obtain the sequence yw), W2,..., YN- 

The values w1, W,..., Wy are not i.i.d. from the posterior of w. The best we can 
say is that 


D 
Yn > yw ~ @(-|X1,..-5Xn) 


asm — oo, where w(-|x1,...,Xn) is the posterior density of w. Also, values suf- 
ficiently far apart in the sequence, will be like i.i.d. values from œ(- | x1, .. . , Xn). Thus, 
one approach is to determine an appropriate value m and then extract W m, Wom, W3ms ++: 
as an approximate i.i.d. sequence from the posterior. Often it is difficult to determine 
an appropriate value for m, however. 

In any case, it is known that, under fairly weak conditions, 


= 1 N as. 
=> Dwi) > Ely) |XX) 
i=l 


as N — oo. So we can use the whole sequence w1, w2,.-.., Wy and record a density 
histogram for y, just as we did in Example 7.3.1. The value of the density histogram 
between two cut points will converge almost surely to the correct value as N —> oo. 
However, we will have to take N larger when using the Gibbs sampling algorithm than 
with i.i.d. sampling, to achieve the same accuracy. For many examples, the effect of the 
deviation of the sequence from being i.i.d. is very small, so N will not have to be much 
larger. We always need to be cautious, however, and the general recommendation is to 


418 Section 7.3: Bayesian Computations 


compute estimates for successively higher values of N, only stopping when the results 
seem to have stabilized. 

In Figure 7.3.5, we have plotted the density histogram of the y values that resulted 
from 10? iterations of the Gibbs sampler. In this case, plotting the density histogram of 
y based upon N = 5 x 10+ and N = 8 x 10* resulted in only minor deviations from 
this plot. Note that this density looks very similar to that plotted in Example 7.3.1, but 
it is not quite so peaked and it has a shorter right tail. 


T T T T T T T 
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 


Figure 7.3.5: A density histogram of N = 10* values of y generated sequentially via Gibbs 
sampling in Example 7.3.2. 


We can also estimate II(w < 0.5|x1,...,2Xn), just as we did in Example 7.3.1, 
by recording the proportion of values in the sequence that are smaller than 0.5, i.e., 
w(yw) = I4(w), where A = {0 : yw < 0.5}. In this case, we obtained the estimate 
0.5441, which is quite different from the value obtained in Example 7.3.1. So using a 
t (3) distribution to describe the variation in the response has made a big difference in 
the results. 

Of course, we must also quantify how accurate we believe our estimate is. Using 
a batch size of m = 10, we obtained the standard error of the estimate 0.5441 to be 
0.00639. When we took the batch size to be m = 20, the standard error of the mean 
is 0.00659; with a batch size of m = 40, the standard error of the mean is 0.00668. 
So we feel quite confident that we are assessing the error in the estimate appropriately. 
Again, under conditions, we have that w is asymptotically normal so that in this case 
we can assert that the interval 0.5441 +3(0.0066) = [0.5243, 0.5639] contains the true 
value of II(y < 0.5|x1,...,X,) with virtual certainty. 

See Appendix B for some code that was used to implement the Gibbs sampling 
algorithm described here. E 


It is fair to say that the introduction of Gibbs sampling has resulted in a revolution 
in statistical applications due to the wide variety of previously intractable problems 
that it successfully handles. There are a number of modifications and closely related 


Chapter 7: Bayesian Inference 419 


algorithms. We refer the interested reader to Chapter 11, where the general theory of 
what is called Markov chain Monte Carlo (MCMC) is discussed. 


Summary of Section 7.3 


e Implementation of Bayesian inference often requires the evaluation of compli- 
cated integrals or sums. 

e If, however, we can sample from the posterior of the parameter, this will often 
lead to sufficiently accurate approximations to these integrals or sums via Monte 
Carlo. 

e It is often difficult to sample exactly from a posterior distribution of interest. 
In such circumstances, Gibbs sampling can prove to be an effective method for 
generating an approximate sample from this distribution. 


EXERCISES 


7.3.1 Suppose we have the following sample from an N (u, 2) distribution, where u is 
unknown. 


26 42 3.1 52 3.7 38 56 18 5.3 4.0 


3.0 40 41 32 22 34 45 29 47 5.2 


If the prior on u is Uniform(2, 6), determine an approximate 0.95-credible interval for 
u based on the large sample results described in Section 7.3.1. 

7.3.2 Determine the form of the approximate 0.95-credible interval of Section 7.3.1, 
for the Bernoulli model with a Beta(a, J) prior, discussed in Example 7.2.2. 

7.3.3 Determine the form of the approximate 0.95-credible intervals of Section 7.3.1, 
for the location-normal model with an N(uo, T) prior, discussed in Example 7.2.3. 
7.3.4 Suppose that X ~ Uniform[0, 1/0] and 6 ~ Exponential(1). Derive a crude 
Monte Carlo algorithm, based on generating from a gamma distribution, to generate a 
value from the conditional distribution 6 | X = x. Generalize this to a sample of n from 
the Uniform[0, 1 /@] distribution. When will this algortithm be inefficient in the sense 
that we need a lot of computation to generate a single value? 

7.3.5 Suppose that X ~ N(@, 1) and@ ~ Uniform[0, 1]. Derive a crude Monte Carlo 
algorithm, based on generating from a normal distribution, to generate from the con- 
ditional distribution 0 |X = x. Generalize this to a sample of n from the N(@, 1) 
distribution. When will this algortithm be inefficient in the sense that we need a lot of 
computation to generate a single value? 

7.3.6 Suppose that X ~ 0.5N(0,1)+0.5N(@, 2) and 0 ~ Uniform[0, 1]. Derive a 
crude Monte Carlo algorithm, based on generating from a mixure of normal distrib- 
utions, to generate from the conditional distribution 6 | X = x. Generalize this to a 
sample of n = 2 from the 0.5N (0, 1) + 0.5N (0, 2) distribution. 


420 Section 7.3: Bayesian Computations 


COMPUTER EXERCISES 


7.3.7 In the context of Example 7.3.1, construct a density histogram of the posterior 
distribution of y = u + 020,25, i.e., the population first quartile, using N = 5 x 10° 
and N = 104, and compare the results. Estimate the posterior mean of this distribution 
and assess the error in your approximation. (Hint: Modify the program in Appendix 
B.) 

7.3.8 Suppose that a manufacturer takes a random sample of manufactured items and 
tests each item as to whether it is defective or not. The responses are felt to be 1.i.d. 
Bernoulli(@), where @ is the probability that the item is defective. The manufacturer 
places a Beta(0.5, 10) distribution on 0. If a sample of n = 100 items is taken and 5 
defectives are observed, then, using a Monte Carlo sample with N = 1000, estimate 
the posterior probability that 0 < 0.1 and assess the error in your estimate. 

7.3.9 Suppose that lifelengths (in years) of a manufactured item are known to follow 
an Exponential(/) distribution, where 4 > 0 is unknown and for the prior we take 2 ~ 
Gamma(10, 2). Suppose that the lifelengths 4.3, 6.2, 8.4, 3.1, 6.0, 5.5, and 7.8 were 
observed. 

(a) Using a Monte Carlo sample of size N = 10°, approximate the posterior probability 
that 2 € [3, 6] and assess the error of your estimate. 

(b) Using a Monte Carlo sample of size N = 10°, approximate the posterior probability 
function of [1/4] (Lx | equals the greatest integer less than or equal to x). 

(c) Using a Monte Carlo sample of size N = 10%, approximate the posterior expecta- 
tion of [1/2] and assess the error in your approximation. 

7.3.10 Generate a sample of n = 10 from a Pareto(2) distribution. Now pretend you 
only know that you have a sample from a Pareto(a) distribution, where a > 0 is 
unknown, and place a Gamma(2, 1) prior on a. Using a Monte Carlo sample of size 
N = 104, approximate the posterior expectation of 1/ (a + 1) based on the observed 
sample, and assess the accuracy of your approximation by quoting an interval that 
contains the exact value with virtual certainty. (Hint: Problem 2.10.15.) 


PROBLEMS 


7.3.11 Suppose X1, ..., Xn is a sample from the model {fọ : 0 € Qy} and all the reg- 
ularity conditions of Section 6.5 apply. Assume that the prior ~m (0) is a continuous 
function of @ and that the posterior mode A(X, wees Xn) LF 0 when Xj pees Xn 18 a 
sample from fg (the latter assumption holds under very general conditions). 
(a) Using the fact that, if Y, 2y Y and g is a continuous function, then g(Y,) => g (Y), 
prove that 

1 8? In(L(8|x1,..., 0 s. 

_18 mE |x : Xn)t (0)) “8 19) 
n 00 o= 


when X1, ..., Xn is a sample from fo. 


(b) Explain to what extent the large sample approximate methods of Section 7.3.1 de- 
pend on the prior if the assumptions just described apply. 


Chapter 7: Bayesian Inference 421 


7.3.12 In Exercise 7.3.10, explain why the interval you constructed to contain the pos- 
terior mean of 1/ (a + 1) with virtual certainty may or may not contain the true value 
of 1/(a +1). 

7.3.13 Suppose that (X, Y) is distributed Bivariate Normal(u), “2,01, 02, p). Deter- 
mine a Gibbs sampling algorithm to generate from this distribution. Assume that you 
have an algorithm for generating from univariate normal distributions. Is this the best 
way to sample from this distribution? (Hint: Problem 2.8.27.) 

7.3.14 Suppose that the joint density of (X, Y) is given by fy y(@v,y) = 8xy for 
0 <x < y < 1. Fully describe a Gibbs sampling algorithm for this distribution. In 
particular, indicate how you would generate all random variables. Can you design an 
algorithm to generate exactly from this distribution? 

7.3.15 In Example 7.3.1, prove that the posterior mean of w = o/ does not exist. 
(Hint: Use Problem 7.2.24 and the theorem of total expectation to split the integral into 
two parts, where one part has value oo and the other part has value —oo.) 

7.3.16 (Importance sampling based on the prior) Suppose we have an algorithm to 
generate from the prior. 

(a) Indicate how you could use this to approximate a posterior expectation using im- 
portance sampling (see Problem 4.5.21). 

(b) What do you suppose is the major weakness is of this approach? 


COMPUTER PROBLEMS 


7.3.17 In the context of Example 7.3.2, construct a density histogram of the posterior 
distribution of y = u +0205, i.e., the population first quartile, using N = 104. Esti- 
mate the posterior mean of this distribution and assess the error in your approximation. 


7.4| Choosing Priors 


The issue of selecting a prior for a problem is an important one. Of course, the idea is 
that we choose a prior to reflect our a priori beliefs about the true value of 0. Because 
this will typically vary from statistician to statistician, this is often criticized as being 
too subjective for scientific studies. It should be remembered, however, that the sam- 
pling model {fọ : 0 € Q} is also a subjective choice by the statistician. These choices 
are guided by the statistician’s judgment. What then justifies one choice of a statistical 
model or prior over another? 

In effect, when statisticians choose a prior and a model, they are prescribing a joint 
distribution for (@, s). The only way to assess whether or not an appropriate choice 
was made is to check whether the observed s is reasonable given this choice. If s is 
surprising, when compared to the distribution prescribed by the model and prior, then 
we have evidence against the statistician’s choices. Methods designed to assess this 
are called model-checking procedures, and are discussed in Chapter 9. At this point, 
however, we should recognize the subjectivity that enters into statistical analyses, but 
take some comfort that we have a methodology for checking whether or not the choices 
made by the statistician make sense. 


422 Section 7.4: Choosing Priors 


Often a statistician will consider a particular family {m} : 4 € A} of priors for a 
problem and try to select a suitable prior z4 € {m1} : A € A}. In sucha context the 
parameter À is called a hyperparameter. Note that this family could be the set of all 
possible priors, so there is no restriction in this formulation. We now discuss some 
commonly used families {7 , : 4 € A} and methods for selecting Ag € A. 


7.4.1 | Conjugate Priors 
Depending on the sampling model, the family may be conjugate. 


Definition 7.4.1 The family of priors {m , : 2 € A} for the parameter 0 of the model 


{Jo : 0 € Q}is conjugate, if for all datas € S andall 2 € A the posterior m 4 (-| 5) € 
{m,:A¢€ A}. 


Conjugacy is usually a great convenience as we start with some choice 49 € A for 
the prior, and then we find the relevant As € A for the posterior, often without much 
computation. While conjugacy can be criticized as a mere mathematical convenience, 
it has to be acknowledged that many conjugate families offer sufficient variety to allow 
for the expression of a wide spectrum of prior beliefs. 


EXAMPLE 7.4.1 Conjugate Families 

In Example 7.1.1, we have effectively shown that the family of all Beta distributions is 
conjugate for sampling from the Bernoulli model. In Example 7.1.2, it is shown that 
the family of normal priors is conjugate for sampling from the location normal model. 
In Example 7.1.3, it is shown that the family of Dirichlet distributions is conjugate for 
Multinomial models. In Example 7.1.4, it is shown that the family of priors specified 
there is conjugate for sampling from the location-scale normal model. E 


Of course, using a conjugate family does not tell us how to select 49. Perhaps the 
most justifiable approach is to use prior elicitation. 


7.4.2 | Elicitation 


Elicitation involves explicitly using the statistician’s beliefs about the true value of 0 
to select a prior in {m} : 4 € A} that reflects these beliefs. Typically, these involve the 
statistician asking questions of himself, or of experts in the application area, in such a 
way that the answers specify a prior from the family. 


EXAMPLE 7.4.2 Location Normal 

Suppose we are sampling from an N(u, o?) distribution with u unknown and o? 
known, and we restrict attention to the family {N (uo, T2) : uo € RÌ, i > 0} of 
priors for u. So here, 2 = (uo, T) and there are two degrees of freedom in this family. 
Thus, specifying two independent characteristics specifies a prior. 

Accordingly, we could ask an expert to specify two quantiles of his or her prior 
distribution for u (see Exercise 7.4.10), as this specifies a prior in the family. For 
example, we might ask an expert to specify a number jg such that the true value of u 
was as likely to be greater than as less than uo, so that uo is the median of the prior. 


Chapter 7: Bayesian Inference 423 


We might also ask the expert to specify a value vo such that there is 99% certainty that 
the true value of is less than vo. This of course is the 0.99-quantile of their prior. 

Alternatively, we could ask the expert to specify the center 9 of their prior dis- 
tribution and for a constant to such that uo + 370 contains the true value of u with 
virtual certainty. Clearly, in this case, xo is the prior mean and to is the prior standard 
deviation. E 


Elicitation is an important part of any Bayesian statistical analysis. If the experts 
used are truly knowledgeable about the application, then it seems intuitively clear that 
we will improve a statistical analysis by including such prior information. 

The process of elicitation can be somewhat involved, however, for complicated 
problems. Furthermore, there are various considerations that need to be taken into ac- 
count involving, prejudices and flaws in the way we reason about probability outside of 
a mathematical formulation. See Garthwaite, Kadane and O’Hagan (2005), “Statisti- 
cal methods for eliciting probability distributions”, Journal of the American Statistical 
Association (Vol. 100, No. 470, pp. 680-700), for a deeper discussion of these issues. 


7.4.3 | Empirical Bayes 


When the choice of Ao is based on the data s, these methods are referred to as empirical 
Bayesian methods. Logically, such methods would seem to violate a basic principle 
of inference, namely, the principle of conditional probability. For when we compute 
the posterior distribution of @ using a prior based on s, in general this is no longer 
the conditional distribution of 0 given the data. While this is certainly an important 
concern, in many problems the application of empirical Bayes leads to inferences with 
satisfying properties. 

For example, one empirical Bayesian method is to compute the prior predictive 
m,(s) for the data s, and then base the choice of 4 on these values. Note that the 
prior predictive is like a likelihood function for A (as it is the density or probability 
function for the observed s), and so the methods of Chapter 6 apply for inference about 
A. For example, we could select the value of As that maximizes m,(s). The required 
computations can be extensive, as / is typically multidimensional. We illustrate with a 
simple example. 


EXAMPLE 7.4.3 Bernoulli 

Suppose we have a sample x1, ..., Xn from a Bernoulli (0) distribution and we contem- 
plate putting a Beta(/, 4) prior on 0 for some 4 > 0. So the prior is symmetric about 
1/2 and the spread in this distribution is controlled by 4. Since the prior mean is 1/2 
and the prior variance is 4? /[(24 + 1)(24)*] = 1/4(24 + 1) > 0 as 4 > oœ, we see 
that choosing / large leads to a very precise prior. Then we have that 


LOM: Pike Bo: 
m; (xı, ae Xn) = EET ; otal a gyn 0-Ð+4-1 do 
T(24) Pax + DTM — x) +4) 


T2(A) T(n +24) 


424 Section 7.4: Choosing Priors 


It is difficult to find the value of 4 that maximizes this, but for real data we can tabulate 
and plot m,(x1,...,%Xn) to obtain this value. More advanced computational methods 
can also be used. 

For example, suppose that n = 20 and we obtained nx = 5 as the number of 1’s 
observed. In Figure 7.4.1 we have plotted the graph of m} (x1, ..., Xn) as a function of 
2. We can see from this that the maximum occurs near 4 = 2. More precisely, from a 
tabulation we determine that 4 = 2.3 is close to the maximum. Accordingly, we use 
the Beta(5 + 2.3, 15 + 2.3) = Beta(7.3, 17.3) distribution for inferences about 0. 


0.000004 + 


0.000003 + 


0.000002 + 


prior predictive 


0.000001 + 


0.000000 + 


T 
0 5 10 15 20 
lambda 


Figure 7.4.1: Plot of m3 (x1 peaks Xn) in Example 7.4.3. 


There are many issues concerning empirical Bayes methods. This represents an 
active area of statistical research. 


7.4.4 | Hierarchical Bayes 


An alternative to choosing a prior for 0 in {z, : 2 € A} consists of putting yet another 
prior distribution œ, called a hyperprior, on 2. This approach is commonly called hi- 
erarchical Bayes. The prior for 0 basically becomes z (0) = f, 23(@)@(A) dd, so we 
have in effect integrated out the hyperparameter. The problem then is how to choose 
the prior w. In essence, we have simply replaced the problem of choosing the prior 
on @ with choosing the hyperprior on 1. It is common, in applications using hierarchi- 
cal Bayes, that default choices are made for œ, although we could also make use of 
elicitation techniques. We will discuss this further in Section 7.4.5. 
So in this situation, the posterior density of 0 is equal to 


Sols) Jn raO AA f fos) 2(0) mO) dj 


eRe "RI me i 


Chapter 7: Bayesian Inference 425 


where m(s) = fa fo fols)t,@)@ (A) ddd = J, mj(s)@(A) da and, for fixed A, 
m,(s) = f fo(s)z,(0) dO (assuming 4 is continuous with prior density given by œ). 
Note that the posterior density of 2 is mj (s)@(1)/m(s) while fo(s)z,(0)/m,(s) is the 
posterior density of 0 given J. 

Therefore, we can use z (0 |s) for inferences about the model parameter 0 (e.g., 
estimation, credible regions, and hypothesis assessment) and m,(s)@(A)/m(s) for in- 
ferences about 1. Typically, however, we are not interested in 4 and in fact it doesn’t 
really make sense to talk about the “true” value of 1. The true value of 0 corresponds 
to the distribution that actually produced the observed data s, at least when the model 
is correct, while we are not thinking of A as being generated from œ. This also implies 
another distinction between 0 and 4. For 0 is part of the likelihood function based on 
how the data was generated, while / is not. 


EXAMPLE 7.4.4 Location-Scale Normal 

Suppose the situation is as is discussed in Example 7.1.4. In that case, both u and 
o are part of the likelihood function and so are model parameters, while yo, tå, ao, 
and fo are not, and so they are hyperparameters. To complete this specification as a 
hierarchical model, we need to specify a prior œ(4ọ, tå, ao, fo), a task we leave to a 
higher-level course. E 


7.4.5 | Improper Priors and Noninformativity 


One approach to choosing a prior, and to stop the chain of priors in a hierarchical Bayes 
approach, is to prescribe a noninformative prior based on ignorance. Such a prior is 
also referred to as a default prior or reference prior. The motivation is to specify a 
prior that puts as little information into the analysis as possible and in some sense 
characterizes ignorance. Surprisingly, in many contexts, statisticians have been led to 
choose noninformative priors that are improper, i.e., fo (0) d0 = œ, so they do not 
correspond to probability distributions. 

The idea here is to give a rule such that, if a statistician has no prior beliefs about 
the value of a parameter or hyperparameter, then a prior is prescribed that reflects this. 
In the hierarchical Bayes approach, one continues up the chain until the statistician 
declares ignorance, and a default prior completes the specification. 

Unfortunately, just how ignorance is to be expressed turns out to be a rather subtle 
issue. In many cases, the default priors turn out to be improper, i.e., the integral or 
sum of the prior over the whole parameter space equals 00, e.g., fo 7(@) d0 = 00, so 
the prior is not a probability distribution. The interpretation of an improper prior is not 
at all clear, and their use is somewhat controversial. Of course, (s, 0) no longer has a 
joint probability distribution when we are using improper priors, and we cannot use the 
principle of conditional probability to justify basing our inferences on the posterior. 

There have been numerous difficulties associated with the use of improper priors, 
which is perhaps not surprising. In particular, it is important to note that there is no 
reason in general for the posterior of @ to exist as a proper probability distribution 
when z is improper. If an improper prior is being used, then we should always check 
to make sure the posterior is proper, as inferences will not make sense if we are using 
an improper posterior. 


426 Section 7.4: Choosing Priors 


When using an improper prior ~ , it is completely equivalent to instead use the prior 
cx for any c > 0, for the posterior under z is proper if and only if the posterior under 
cz is proper; then the posteriors are identical (see Exercise 7.4.6). 

The following example illustrates the use of an improper prior. 


EXAMPLE 7.4.5 Location Normal Model with an Improper Prior 
Suppose that (x1, ...,X,) is a sample from an N (u, o?) distribution, where u € Q = 
R! is unknown and o? is known. Many arguments for default priors in this context lead 
to the choice m (u) = 1, which is clearly improper. 

Proceeding as in Example 7.1.2, namely, pretending that this z is a proper proba- 
bility density, we get that the posterior density of u is proportional to 


This immediately implies that the posterior distribution of u is N(x, o? /n). Note 
that this is the same as the limiting posterior obtained in Example 7.1.2 as To ov, 
although the point of view is quite different. E 


One commonly used method of selecting a default prior is to use, when it is avail- 
able, the prior given by 7!/ (0) when 0 e R! (and by (det I (ON 2 in the multidimen- 
sional case), where J is the Fisher information for the statistical model as defined in 
Section 6.5. This is referred to as Jeffreys’ prior. Note that Jeffreys’ prior is dependent 
on the model. 

Jeffreys’ prior has an important invariance property. From Challenge 6.5.19, we 
have that, under some regularity conditions, if we make a 1—1 transformation of the 
real-valued parameter 0 via y = ¥ (0), then the Fisher information of wy is given by 


(9!) (o) 


Therefore, the default Jeffreys’ prior for y is 


2 


pp Ca w) (y o) i (7.4.1) 


Now we see that, if we had started with the default prior /'/?(@) for 9 and made the 
change of variable to w, then this prior transforms to (7.4.1) by Theorems 2.6.2 and 
2.6.3. A similar result can be obtained when @ is multidimensional. 

Jeffreys’ prior often turns out to be improper, as the next example illustrates. 


EXAMPLE 7.4.6 Location Normal (Example 7.4.5 continued) 

In this case, Jeffreys’ prior is given by z (0) = ./n/ao, which gives the same posterior 
as in Example 7.4.5. Note that Jeffreys’ prior is effectively a constant and hence the 
prior of Example 7.4.5 is equivalent to Jeffreys’ prior. E 


Research into rules for determining noninformative priors and the consequences of 
using such priors is an active area in statistics. While the impropriety seems counterin- 
tuitive, their usage often produces inferences with good properties. 


Chapter 7: Bayesian Inference 427 


Summary of Section 7.4 


To implement Bayesian inference, the statistician must choose a prior as well as 
the sampling model for the data. 


These choices must be checked if the inferences obtained are supposed to have 
practical validity. This topic is discussed in Chapter 9. 


Various techniques have been devised to allow for automatic selection of a prior. 
These include empirical Bayes methods, hierarchical Bayes, and the use of non- 
informative priors to express ignorance. 


Noninformative priors are often improper. We must always check that an im- 
proper prior leads to a proper posterior. 


EXERCISES 


7.4.1 Prove that the family {Gamma(a, £): a > 0, £ > 0} is a conjugate family of 
priors with respect to sampling from the model given by Pareto() distributions with 
A> 0. 


7.4.2 Prove that the family {7,,(0) : a > 1, 8 > 0} of priors given by 


0™ Trp oo) (9) 

(a — 1p"! 

is a conjugate family of priors with respect to sampling from the model given by the 
Uniform[0, 0] distributions with 6 > 0. 

7.4.3 Suppose that the statistical model is given by 


Ta p0) oa 


| || pol) P) po) A 


0 =a | 1/3 1/6 1/3 1/6 
0 =b | 1/2 1/4 1/8 1/8 
and that we consider the family of priors given by 


ad m7 (@) m,(b) 


7=1/ 12 1⁄2 
c=2| 13 2/3 


and we observe the sample x; = 1, x2 = 1, x2 = 3. 


(a) If we use the maximum value of the prior predictive for the data to determine the 
value of t, and hence the prior, which prior is selected here? 

(b) Determine the posterior of 0 based on the selected prior. 

7.4.4 For the situation described in Exercise 7.4.3, put a uniform prior on the hyperpa- 
rameter t and determine the posterior of 0. (Hint: Theorem of total probability.) 

7.4.5 For the model for proportions described in Example 7.1.1, determine the prior 
predictive density. Ifn = 10 and nx = 7, which of the priors given by (a, $) = (1, 1) 
or (a, 8) = (5,5) would the prior predictive criterion select for further inferences 
about 0? 


428 Section 7.4: Choosing Priors 


7.4.6 Prove that when using an improper prior ~ , the posterior under z is proper if and 
only if the posterior under cz is proper for c > 0, and then the posteriors are identical. 
7.4.7 Determine Jeffreys’ prior for the Bernoulli(@) model and determine the posterior 
distribution of 0 based on this prior. 

7.4.8 Suppose we are sampling from a Uniform[0,0],@ > 0 model and we want to 
use the improper prior m (0) = 1. 

(a) Does the posterior exist in this context? 

(b) Does Jeffreys’ prior exist in this context? 

7.4.9 Suppose a student wants to put a prior on the mean grade out of 100 that their 
class will obtain on the next statistics exam. The student feels that a normal prior 
centered at 66 is appropriate and that the interval (40, 92) should contain 99% of the 
marks. Fully identify the prior. 

7.4.10 A lab has conducted many measurements in the past on water samples from 
a particular source to determine the existence of a certain contaminant. From their 
records, it was determined that 50% of the samples had contamination less than 5.3 
parts per million, while 95% had contamination less than 7.3 parts per million. If a 
normal prior is going to be used for a future analysis, what prior do these data deter- 
mine? 

7.4.11 Suppose that a manufacturer wants to construct a 0.95-credible interval for the 
mean lifetime 0 of an item sold by the company. A consulting engineer is 99% certain 
that the mean lifetime is less than 50 months. If the prior on @ is an Exponential (A), 
then determine / based on this information. 

7.4.12 Suppose the prior on a model parameter u is taken to be N (uo, da); where uo 
and o? are hyperparameters. The statistician is able to elicit a value for 4 but feels 
unable to do this for ob. Accordingly, the statistician puts a hyperprior on o? given by 
1/02 ~ Gamma(ao, 1) for some value of a9. Determine the prior on u. (Hint: Write 
LM = Lo +002, where z ~ N (0, 1).) 


COMPUTER EXERCISES 


7.4.13 Consider the situation discussed in Exercise 7.4.5. 

(a) If we observe n = 10, nx = 7, and we are using a symmetric prior, i.e., a = J, plot 
the prior predictive as a function of a in the range (0, 20) (you will need a statistical 
package that provides evaluations of the gamma function for this). Does this graph 
clearly select a value for a? 

(b) If we observe n = 10,nx = 9, plot the prior predictive as a function of a in the 
range (0, 20). Compare this plot with that in part (a). 

7.4.14 Reproduce the plot given in Example 7.4.3 and verify that the maximum occurs 
near À = 2.3. 


PROBLEMS 


7.4.15 Show that a distribution in the family {N (uo, 7) : uo € R}, a > 0} is com- 
pletely determined once we specify two quantiles of the distribution. 


Chapter 7: Bayesian Inference 429 


7.4.16 (Scale normal model) Consider the family of N (19, 07) distributions, where 
Lo is known and g? > 0 is unknown. Determine Jeffreys’ prior for this model. 

7.4.17 Suppose that for the location-scale normal model described in Example 7.1.4, 
we use the prior formed by the Jeffreys’ prior for the location model (just a constant) 
times the Jeffreys’ prior for the scale normal model. Determine the posterior distribu- 
tion of (u, o°). 

7.4.18 Consider the location normal model described in Example 7.1.2. 

(a) Determine the prior predictive density m. (Hint: Write down the joint density of 
the sample and u. Use (7.1.2) to integrate out u and do not worry about getting m into 
a recognizable form.) 

(b) How would you generate a value (X1, ..., Xn) from this distribution? 

(c) Are X1, ..., Xn mutually independent? Justify your answer. (Hint: Write X; = 
H+00Z), u = Ho + toZ, where Z, Z1,..., Zn are i.i.d. N(O, 1).) 

7.4.19 Consider Example 7.3.2, but this time use the prior z(u, c?) = 1/07. De- 
velop the Gibbs sampling algorithm for this situation. (Hint: Simply adjust each full 
conditional in Example 7.3.2 appropriately.) 


COMPUTER PROBLEMS 


7.4.20 Use the formulation described in Problem 7.4.17 and the data in the following 


table 
2.6 42 3.1 52 3.7 38 56 18 53 4.0 


3.0 40 41 32 22 34 45 29 47 5.2 


generate a sample of size N = 10* from the posterior. Plot a density histogram estimate 
of the posterior density of y = ø / based on this sample. 


CHALLENGES 


7.4.21 When 0 = (01, 02), the Fisher information matrix / (01, 92) is defined in Prob- 
lem 6.5.15. The Jeffreys’ prior is then defined as (det I (01, 02))!/?. Determine Jef- 
freys’ prior for the location-scale normal model and compare this with the prior used 
in Problem 7.4.17. 


DISCUSSION TOPICS 


7.4.22 Using empirical Bayes methods to determine a prior violates the Bayesian prin- 
ciple that all unknowns should be assigned probability distributions. Comment on this. 
Is the hierarchical Bayesian approach a solution to this problem? 


430 Section 7.5: Further Proofs (Advanced) 


7.5| Further Proofs (Advanced) 


Derivation of the Posterior Distribution for the Location-Scale 
Normal Model 


In Example 7.1.4, the likelihood function is given by 


L@|x1,...,Xn) = Rro" exp (-= (x - w’) exp (- = t2) À 
202 202 
The prior on (u, 07) is given by u |o? ~ N( uo, trôa?) and 1/0? ~Gamma(ao, Bo)» 
where uo, Tô, ao and fo are fixed and known. 
The posterior density of (u, @ 7?) is then proportional to the likelihood times the 
joint prior. Therefore, retaining only those parts of the likelihood and the prior that 
depend on u and o°, the joint posterior density is proportional to 


(e memet]: 
C Ea 
E 1 


II 
ro) 
tal 
se) 


1 1 
=o lnt p= nx + #2 u\ |x 
20 TO TO 
È 1 


ag+n/2— sare 
1 2 2 n-1 sy) 1l 
z A E alae aa bee 

sj 2 

1) 1 1 1 uo <x 
=) exp 562 n+ 5 ia Ea Ty tnx x 
o To To T 
( ao+n/2—1 Po + 3r + 2r zau o + aar 


From this, we deduce that the posterior distribution of (u ; o?) is given by 


-1 
1 
ulo?,x ~N m(2+5) o? 
TO 


Chapter 7: Bayesian Inference 431 


and i 
at |x ~ Gamma(ao + 7/2, px) 
where a 
1 5 
To To 
and 


n=l, In(%— u)? 
EE R i a RO 
2 2 1 +n? 


Derivation of J (8 (w9, 4)) for the Location-Scale Normal 


Here we have that 
a oc IP ee 
y =y, o“ )=-=-|> 
u 


u No 
and i 
2 = Alu, 07) =z 
We have that 
ow _ oy 
õu a(4) u7? a -lu C 
det = |det g? : g? 
* a) i 
Ou afl 
IGZ 
231/2 14-1 
—y?A! 2 ly} 
= |det i ad = yall, 
0 1 
and so 


JO(wo, 4) = APN. 


Chapter 8 
Optimal Inferences 


CHAPTER OUTLINE 


Section 1 Optimal Unbiased Estimation 
Section 2 Optimal Hypothesis Testing 
Section 3 Optimal Bayesian Inferences 
Section 4 Decision Theory (Advanced) 
Section 5 Further Proofs (Advanced) 


In Chapter 5, we introduced the basic ingredient of statistical inference — the statistical 
model. In Chapter 6, inference methods were developed based on the model alone via 
the likelihood function. In Chapter 7, we added the prior distribution on the model 
parameter, which led to the posterior distribution as the basis for deriving inference 
methods. 

With both the likelihood and the posterior, however, the inferences were derived 
largely based on intuition. For example, when we had a characteristic of interest w (0), 
there was nothing in the theory in Chapters 6 and 7 that forced us to choose a particular 
estimator, confidence or credible interval, or testing procedure. A complete theory of 
statistical inference, however, would totally prescribe our inferences. 

One attempt to resolve this issue is to introduce a performance measure on infer- 
ences and then choose an inference that does best with respect to this measure. For 
example, we might choose to measure the performance of estimators by their mean- 
squared error (MSE) and then try to obtain an estimator that had the smallest possible 
MSE. This is the optimality approach to inference, and it has been quite successful 
in a number of problems. In this chapter, we will consider several successes for the 
optimality approach to deriving inferences. 

Sometimes the performance measure we use can be considered to be based on 
what is called a loss function. Loss functions form the basis for yet another approach 
to statistical inference called decision theory. While it is not always the case that a 
performance measure is based on a loss function, this holds in some of the most impor- 
tant problems in statistical inference. Decision theory provides a general framework in 
which to discuss these problems. A brief introduction to decision theory is provided in 
Section 8.4 as an advanced topic. 


433 


434 Section 8.1: Optimal Unbiased Estimation 


8.1 | Optimal Unbiased Estimation 


Suppose we want to estimate the real-valued characteristic y(@) for the statistical 
model { fg : 0 € Q}. If we have observed the data s, an estimate is a value T (s) that the 
statistician hopes will be close to the true value of w (0). We refer to T as an estimator 
of y. The error in the estimate is given by |T (s) — y(@)|. For a variety of reasons 
(mostly to do with mathematics) it is more convenient to consider the squared error 
(T(s) — w@)). 

Of course, we would like this squared error to be as small as possible. Because 
we do not know the true value of 0, this leads us to consider the distributions of the 
squared error, when s has distribution given by fg, for each 0 €e Q. We would then 
like to choose the estimator T so that these distributions are as concentrated as possible 
about 0. A convenient measure of the concentration of these distributions about 0 is 
given by their means, or 


MSE9(T) = Eg((T — y (0°), (8.1.1) 


called the mean-squared error (recall Definition 6.3.1). 
An optimal estimator of y(@) is then a T that minimizes (8.1.1) for every 0 € Q. 
In other words, T would be optimal if, for any other estimator T* defined on S, we 
have that 
MSE@(T) < MSE (T*) 


for each 0. Unfortunately, it can be shown that, except in very artificial circumstances, 
there is no such T, so we need to modify our optimization problem. 

This modification takes the form of restricting the estimators T that we will enter- 
tain as possible choices for the inference. Consider an estimator T such that Eg(T) 
does not exist or is infinite. It can then be shown that (8.1.1) is infinite (see Challenge 
8.1.26). So we will first restrict our search to those T for which Eg(T) is finite for 
every 0. 

Further restrictions on the types of estimators that we consider make use of the 
following result (recall also Theorem 6.3.1). 


Theorem 8.1.1 If T is such that E(T7) is finite, then 


E((T — 0°) = Var(T) + (E(T) — 0)’, 


This is minimized by taking c = £E (T). 


We have that 
E((T - 0°) = E((T — E (T) + E (T) -c)’) 
= E((T — E(T))) + 2E (T — E (T)) (E (T) — ¢) + (E (T) - c} 
= Var(T) + (E (T) —c)’, (8.1.2) 


because E (T — E (T)) = E(T) — E (T) = 0. As (E (T) — ©? > 0, and Var(T) does 
not depend on c, the value of (8.1.2) is minimized by taking c = E (T). 


Chapter 8: Optimal Inferences 435 


8.1.1 | The Rao—Blackwell Theorem and Rao-Blackwellization 


We will prove that, when we are looking for T to minimize (8.1.1), we can further 
restrict our attention to estimators T that depend on the data only through the value 
of a sufficient statistic. This simplifies our search, as sufficiency often results in a 
reduction of the dimension of the data (recall the discussion and examples in Section 
6.1.1). First, however, we need the following property of sufficiency. 


Theorem 8.1.2 A statistic U is sufficient for a model if and only if the conditional 


distribution of the data s given U = u is the same for every 0 € Q. 


PROOF | See Section 8.5 for the proof of this result. E 


The implication of this result is that information in the data s beyond the value of 
U(s) = u can tell us nothing about the true value of 0, because this information comes 
from a distribution that does not depend on the parameter. Notice that Theorem 8.1.2 
is a characterization of sufficiency, alternative to that provided in Section 6.1.1. 

Consider a simple example that illustrates the content of Theorem 8.1.2. 


EXAMPLE 8.1.1 


Suppose that S = {1, 2,3, 4}, Q = {a, b}, where the two probability distributions are 
given by the following table. 


| 
Q@=a| 1/2 If 1/6 1/6 
1/4 1/4 1/4 1⁄4 


Then L(-|2) = L(-|3) = L(- |4), and so U : S — {0, 1}, given by U (1) = 0 and 
U (2) = U (3) = U (4) = 1 is a sufficient statistic. 


As we must have s = 1 when we observe U (s) = 0, the conditional distribution of 
the response s, given U (s) = 0, is degenerate at 1 (i.e., all the probability mass is at 
the point 1) for both 0 = a and @ = b. When 0 = a, the conditional distribution of the 
response s, given U(s) = 1, places 1/3 of its mass at each of the points in {2, 3, 4} and 
similarly when 0 = b. So given U(s) = 1, the conditional distributions are as in the 
following table. 


0=al| 0 is 13 153 


6=b| 0 13 1/73 1/3 


Thus, we see that indeed the conditional distributions are independent of 0. E 


We now combine Theorems 8.1.1 and 8.1.2 to show that we can restrict our at- 
tention to estimators T that depend on the data only through the value of a sufficient 
statistic U. By Theorem 8.1.2 we can denote the conditional probability measure for 
s, given U (s) = u, by P(- |U = u), i.e., this probability measure does not depend on 
0, as it is the same for every 0 € Q. 

For estimator T of y (0), such that Eg(T) is finite for every 0, put Ty (s) equal to 
the conditional expectation of T given the value of U(s), namely, 


Ty (s) = Epc. u=u(sy (T), 


436 Section 8.1: Optimal Unbiased Estimation 


i.e., Ty is the average value of T when we average using P(-|U = U(s)). Notice 
that Ty (s1) = Ty(s2) whenever U (s1) = U (s2) (this is because P(.|U = U (s1)) = 
P(- |U = U(s2))), and so Ty depends on the data s only through the value of U (s). 


Theorem 8.1.3 (Rao—Blackwell) Suppose that U is a sufficient statistic and Eg (T) 


is finite for every 0. Then MSEg(7y) < MSE@(T) for every 0 € Q. 


PROOF | Let Pg y denote the marginal probability measure of U induced by Py. By 
the theorem of total expectation (see Theorem 3.5.2), we have that 


MSE¢(T) = Ep, y (Ee juan T = vO), 


where E'p(.| vu) ((T — w(0))) denotes the conditional MSE of T, given U = u. Now 
by Theorem 8.1.1, 


Epi u= ((T — YOP) = Varre u=) (T) + (Epei u= (T) — y@)). (8.1.3) 
As both terms in (8.1.3) are nonnegative, and recalling the definition of Ty, we have 


MSE% (T) = Ep} y (Varp¢.|u=w(T)) + Ep, y (Tu(s) — w(6))*) 
> Ep, y ((Tu(s) — w@))’). 


Now (Ty (s) — w(@)Y = EP u=) (Cu (s) — w(0))*) (Theorem 3.5.4) and so, by 
the theorem of total expectation, 


Ery (Tu 8) = VO) = Er (Eriu (Tus) - v0) 
= Ep,((Tu(s) — w(@))”) = MSE (Tu) 
and the theorem is proved. E 


Theorem 8.1.3 shows that we can always improve on (or at least make no worse) 
any estimator T that possesses a finite second moment, by replacing T (s) by the esti- 
mate Ty(s). This process is sometimes referred to as the Rao-Blackwellization of an 
estimator. 

Notice that putting E = Eg and c = y (0) in Theorem 8.1.1 implies that 


MSEg(T) = Varo (T) + (Eo (T) — w(0)). (8.1.4) 


So the MSE of T can be decomposed as the sum of the variance of T plus the squared 
bias of T (this was also proved in Theorem 6.3.1). 

Theorem 8.1.1 has another important implication, for (8.1.4) is minimized by tak- 
ing w (0) = Eọ(T). This indicates that, on average, the estimator T comes closer (in 
terms of squared error) to Ea(T) than to any other value. So, if we are sampling from 
the distribution specified by 0, T(s) is a natural estimate of Eg(T). Therefore, for a 
general characteristic wy (0), it makes sense to restrict attention to estimators that have 
bias equal to 0. This leads to the following definition. 


Chapter 8: Optimal Inferences 437 


Definition 8.1.1 An estimator T of y(@) is unbiased if Eg(T) = w(@) for every 
GeEQ. 


Notice that, for unbiased estimators with finite second moment, (8.1.4) becomes 
MSE@(T) = Varg (T). 


Therefore, our search for an optimal estimator has become the search for an unbiased 
estimator with smallest variance. If such an estimator exists, we give it a special name. 


Definition 8.1.2 An unbiased estimator of y(@) with smallest variance for each 


6 e Q is called a uniformly minimum variance unbiased (UMVU) estimator. 


It is important to note that the Rao—Blackwell theorem (Theorem 8.1.3) also ap- 
plies to unbiased estimators. This is because the Rao—Blackwellization of an unbiased 
estimator yields an unbiased estimator, as the following result demonstrates. 


Theorem 8.1.4 (Rao—Blackwell for unbiased estimators) If T has finite second mo- 


ment, is unbiased for y(@), and U is a sufficient statistic, then Eg (Tuy) = w (0) for 
every 0 € Q (so Ty is also unbiased for y(@)) and Varg(Ty) < Varg (T). 


PROOF | Using the theorem of total expectation (Theorem 3.5.2), we have 


Eo (Tu) = Ep, y (Tu) = Epy (Ep(ju=w(T)) = Eo(T) = y0). 


So Ty is unbiased for y(@) and MSEg(T) = Varg(T), MSEg(Ty) = Varg(Ty). Ap- 
plying Theorem 8.1.3 gives Varg (Ty) < Varo (T). E 


There are many situations in which the theory of unbiased estimation leads to good 
estimators. However, the following example illustrates that in some problems, there 
are no unbiased estimators and hence the theory has some limitations. 


EXAMPLE 8.1.2 The Nonexistence of an Unbiased Estimator 
Suppose that (x1,...,x,) is a sample from the Bernoulli(@) and we wish to find a 
UMVU estimator of y(@) = 0/ (1 — 8), the odds in favor of a success occurring. From 
Theorem 8.1.4, we can restrict our search to unbiased estimators T that are functions 
of the sufficient statistic nx. 

Such a T satisfies Eg(T(nX)) = 0/ (1 — 0) for every 0 e [0, 1]. Recalling that 
nX ~ Binomial(n, 0) , this implies that 


0 _x NY aka _ gyn—k 
a= o(;)e (1 —6) 


for every 0 € [0, 1]. By the binomial theorem, we have 


n—k 
a-oy*=> ( rÀ (-1)! 0. 


1=0 


438 Section 8.1: Optimal Unbiased Estimation 


Substituting this into the preceding expression for 0/ (1 — 0) and writing this in terms 
of powers of @ leads to 


5 -> (= T(k) (o com (8.1.5) 
m=0 \k=0 


Now the left-hand side of (8.1.5) goes to co as O — 1, but the right-hand side is a 
polynomial in 0, which is bounded in [0, 1]. Therefore, an unbiased estimator of w 
cannot exist. ll 


Ifa characteristic y (0) has an unbiased estimator, then it is said to be U-estimable. 
It should be kept in mind, however, that just because a parameter is not U-estimable 
does not mean that we cannot estimate it! For example, y in Example 8.1.2, is a 1—1 
function of 0, so the MLE of y is given by x / (1 — x) (see Theorem 6.2.1); this seems 
like a sensible estimator, even if it is biased. 


8.1.2 | Completeness and the Lehmann-Scheffé Theorem 


In certain circumstances, if an unbiased estimator exists, and is a function of a sufficient 
statistic U, then there is only one such estimator — so it must be UMVU. We need the 
concept of completeness to establish this. 


Definition 8.1.3 A statistic U is complete if any function h of U, which satisfies 
Eo(hUU)) = 0 for every 0 € Q, also satisfies h(U(s)) = 0 with probability 1 for 


each 8 € Q (i.e., Pas : h(U(s)) = 0}) = 1 for every 0 € Q). 


In probability theory, we treat two functions as equivalent if they differ only on a set 
having probability content 0, as the probability of the functions taking different values 
at an observed response value is 0. So in Definition 8.1.3, we need not distinguish 
between h and the constant 0. Therefore, a statistic U is complete if the only unbiased 
estimator of 0, based on U, is given by 0 itself. 

We can now derive the following result. 


Theorem 8.1.5 (Lehmann—Scheffé) If U is a complete sufficient statistic, and if T 


depends on the data only through the value of U, has finite second moment for every 
0, and is unbiased for y (0), then T is UMVU. 


PROOF | Suppose that 7* is also an unbiased estimator of y(@). By Theorem 8.1.4 
we can assume that T* depends on the data only through the value of U. Then there 
exist functions A and h* such that T (s) = h(U(s)) and T*(s) = h*(U(s)) and 


0 = Eo (T) — Eo (T*) = Eo (h(U)) — Eg (h* (U)) = Eo (h (U) — h* (U)). 


By the completeness of U, we have that h(U) = h*(U) with probability 1 for each 
0 e Q, which implies that T = T* with probability 1 for each 0 € Q. This says 
there is essentially only one unbiased estimator for w (0) based on U, and so it must be 
UMVU. 1I 


Chapter 8: Optimal Inferences 439 


The Rao—Blackwell theorem for unbiased estimators (Theorem 8.1.4), together 
with the Lehmann—Scheffé theorem, provide a method for obtaining a UMVU esti- 
mator of y(@). Suppose we can find an unbiased estimator T that has finite second 
moment. If we also have a complete sufficient statistic U, then by Theorem 8.1.4 
Tu(s) = Ep(.|u=u(s) (T) is unbiased for y(@) and depends on the data only through 
the value of U, because Ty(s1) = Ty(s2) whenever U (s1) = U(s2). Therefore, by 
Theorem 8.1.5, Ty is UMVU for y(@). 

It is not necessary, in a given problem, that a complete sufficient statistic exist. 
In fact, it can be proved that the only candidate for this is a minimal sufficient statistic 
(recall the definition in Section 6.1.1). So in a given problem, we must obtain a minimal 
sufficient statistic and then determine whether or not it is complete. We illustrate this 
via an example. 


EXAMPLE 8.1.3 Location Normal 

Suppose that (x1, ..., Xn) is a sample from an N (u, or) distribution, where u € R! 
is unknown and o? > 0 is known. In Example 6.1.7, we showed that x is a minimal 
sufficient statistic for this model. 

In fact, x is also complete for this model. The proof of this is a bit involved and is 
presented in Section 8.5. 

Given that x is a complete, minimal sufficient statistic, this implies that T(x) is a 
UMVU estimator of its mean E,,(T (X)) whenever T has a finite second moment for 
every u € R!. In particular, x is the UMVU estimator of u because E u (X) = u and 
Eu (X?) = (o3/n) + u? < oo. Furthermore, ¥ + O0Zp is the UMVU estimator of 
Ey (X + 002 p) = H + 002p (the pth quantile of the true distribution). E 


The arguments needed to show the completeness of a minimal sufficient statistic in 
a problem are often similar to the one required in Example 8.1.3 (see Challenge 8.1.27). 
Rather than pursue such technicalities here, we quote some important examples in 
which the minimal sufficient statistic is complete. 


EXAMPLE 8.1.4 Location-Scale Normal 

Suppose that (x1, ... , Xn) is a sample from an N (u, o°) distribution, where u € R! 
and o > 0 are unknown. The parameter in this model is two-dimensional and is given 
by (u, 07) € R! x (0, 00). 

We showed, in Example 6.1.8, that (¥, s?) is a minimal sufficient statistic for this 
model. In fact, it can be shown that (x, s?) is a complete minimal sufficient statistic. 
Therefore, T(x, s?) is a UMVU estimator of Eg(T(X, S)) whenever the second mo- 
ment of T (x, s?) is finite for every (u, 07). In particular, x is the UMVU estimator of 
u and s? is UMVU for o°. U 


EXAMPLE 8.1.5 Distribution-Free Models 
Suppose that (x1, .. . , Xn) is a sample from some continuous distribution on R!. The 
statistical model comprises all continuous distributions on R!. 

It can be shown that the order statistics (x(1), .. . , X(n)) make up a complete minimal 
sufficient statistic for this model. Therefore, T(x(1), ..., X(n)) is UMVU for 


Eo(T(Xq), ea » X(n))) 


440 Section 8.1: Optimal Unbiased Estimation 


whenever 
Eg(T?(Xqy,---,X(n)) < 00 (8.1.6) 


for every continuous distribution. In particular, if T : R” —> R! is bounded, then this 
is the case. For example, if 


1 n 
TED rm) = > S ueo), 
i=l 


the relative frequency of the event A in the sample, then T(x1),...,X(n)) is UMVU 
for Eg(T(Xq),..., X(n))) = Pa (4). 

Now change the model assumption so that (x1,...,X,) is a sample from some 
continuous distribution on R! that possesses its first m moments. Again, it can be 
shown that the order statistics make up a complete minimal sufficient statistic. There- 
fore, T(x(1),..-,X(n)) is UMVU for Eg (T (X (1), ..., X(n))) whenever (8.1.6) holds for 
every continuous distribution possessing its first m moments. For example, if m = 2, 
then this implies that T(x(1),...,x(n)) = x is UMVU for E(X). When m = 4, we 
have that s? is UMVU for the population variance (see Exercise 8.1.2). E 


8.1.3 | The Cramer—Rao Inequality (Advanced) 


There is a fundamental inequality that holds for the variance of an estimator T. This is 
given by the Cramer—Rao inequality (sometimes called the information inequality). It 
is a corollary to the following inequality. 


Theorem 8.1.6 (Covariance inequality) Suppose T, Ug : S > R! and Eg(T?) < 
00,0 < E9(Us) < oo for every 0 € Q. Then 


(Cove (T, Ug)" 


ey 


for every 0 € Q. Equality holds if and only if 


Cove (T, Up) 


T(s) = Eg(T) + Vars (Up 


(Ua (s) — Eo (Uo (s))) 


with probability 1 for every 0 € Q (i.e., if and only if T (s) and Ug (s) are linearly 
related). 


PROOF | This result follows immediately from the Cauchy—Schwartz inequality (The- 


orem 3.6.3). E 
Now suppose that Q is an open subinterval of R! and we take 
él 
Us (s) = S(@|s) = nie (8.1.7) 


i.e., Ug is the score function. Assume that the conditions discussed in Section 6.5 hold, 
so that Eg(S(@|s)) = 0 for all 0, and, Fisher’s information J (0) = Varg(S(@ |s)) is 


Chapter 8: Optimal Inferences 441 


finite. Then using 


aln fats) _ afols)_1 


A 00 fats)’ 
we have 
Cova (T, Us) 
an f(s) af) 1 
z (roy ZO )=£ Ae a a5) 
E B 1 E ôE m 
e ro 0) = AO (8.1.8) 


in the discrete case, where we have assumed conditions like those discussed in Section 
6.5, so we can pull the partial derivative through the sum. A similar argument gives the 
equality (8.1.8) in the continuous case as well. 

The covariance inequality, applied with Ug specified as in (8.1.7) and using (8.1.8), 
gives the following result. 


Corollary 8.1.1 (Cramer—Rao or information inequality) Under conditions, 


Vary (T) = me Fel)’ UO 


for every 0 € Q. Equality holds if and only if 


T(s) = Eo(T) + Ret LEA O SEI] 


with probability 1 for every 0 € Q. 


The Cramer-Rao inequality provides a fundamental lower bound on the variance 
ofan estimator T. From (8.1.4), we know that the variance is a relevant measure of the 
accuracy of an estimator only when the estimator is unbiased, so we restate Corollary 
8.1.1 for this case. 


Corollary 8.1.2 Under the conditions of Corollary 8.1.1, when T is an unbiased 
estimator of y (0), 


Varg (T) > (OTOT! 
for every 0 € Q. Equality holds if and only if 


T(s) = y0) + y'O OTSO |s) 


with probability 1 for every 0 e Q. 


Notice that when w(0) = 0, then Corollary 8.1.2 says that the variance of the 
unbiased estimator T is bounded below by the reciprocal of the Fisher information. 
More generally, when y is a 1—1, smooth transformation, we have (using Challenge 
6.5.19) that the variance of an unbiased T is again bounded below by the reciprocal of 


442 Section 8.1: Optimal Unbiased Estimation 


the Fisher information, but this time the model uses the parameterization in terms of 
y (0). 

Corollary 8.1.2 has several interesting implications. First, if we obtain an unbiased 
estimator T with variance at the lower bound, then we know immediately that it is 
UMVU. Second, we know that any unbiased estimator that achieves the lower bound 
is of the form given in (8.1.9). Note that the right-hand side of (8.1.9) must be inde- 
pendent of 0 in order for this to be an estimator. If this is not the case, then there are no 
UMVU estimators whose variance achieves the lower bound. The following example 
demonstrates that there are cases in which UMVU estimators exist, but their variance 
does not achieve the lower bound. 


EXAMPLE 8.1.6 Poisson(1) Model 


Suppose that (x1, .. . , Xn) is a sample from the Poisson(/) distribution where 4 > 0 is 
unknown. The log-likelihood is given by /(A |x1,..., Xn) =x ln å — nå, so the score 
function is given by S(A|x1,...,X,) =nx/A — n. Now 


OS(A|x1,.--5Xn) nx 


ar Se? 


nx n 
I =E,(— )=-. 
a= 2:(5)=3 


Suppose we are estimating 1. Then the Cramer—Rao lower bound is given by 
I-'() = 4/n. Noting that x is unbiased for A and that Var,(X) = A/n, we see 
immediately that x is UMVU and achieves the lower bound. 

Now suppose that we are estimating y(A) = e~? = P,({0}). The Cramer-Rao 
lower bound equals 2e~? /n and 


and thus 


wd) PAITSI x1, +5 Xn) 


Il 
® 
X 
| 
| 
>” 
ATT 
aI 
— 
a 
= 
~|&, 
| 
= 
— 


= e“*(1-¥+4+A), 


which is clearly not independent of 2. So there does not exist a UMVU estimator for 
y that attains the lower bound. 

Does there exist a UMVU estimator for y? Observe that when n = 1, then Jo} (x1) 
is an unbiased estimator of w. As it turns out, x is (for every n) a complete mini- 
mal sufficient statistic for this model, so by the Lehmann-Scheffé theorem Jo} (x1) is 
UMVU for y. Furthermore, /;9} (X71) has variance 


P(X, = 0) (1 — Pi(X% = 0)) =e * (1 — e7?) 


since Jyo (X1) ~ Bernoulli(e~*). This implies that e~?(1 — e74) > 4e7”. 


In general, we have that 
1 n 
= Š loe) 
n 4 
¿=l 


is an unbiased estimator of y, but it is not a function of x. Thus we cannot apply the 
Lehmann-Scheffé theorem, but we can Rao—Blackwellize this estimator. Therefore, 


Chapter 8: Optimal Inferences 443 
the UMVU estimator of y is given by 
la = 
— > Eloy Xi) |X = 3). 
i=l 


To determine this estimator in closed form, we reason as follows. The condi- 
tional probability function of (X1,..., Xn) given X = x, because nX is distributed 
Poisson(”A) , is 


3 -1 
A% A% —nàÀ (niy"* HA nx 1 el 1 *n 
—...--——-e e = — Fa geo ; 
xı! xi! (nx)! PS n n 
i.e., (X1, ..., Xn) given X = x is distributed Multinomial(nx, 1/n,...,1/n). Ac- 
cordingly, the UMVU estimator is given by 


E(I1o\(X1) |X =X) = P(X =0|X =x) = (: -*) 


because X; | X = ¥ ~ Binomial(nx, 1/n) for each i = 1,...,n. 

Certainly, it is not at all obvious from the functional form that this estimator is 
unbiased, let alone UMVU. So this result can be viewed as a somewhat remarkable 
application of the theory. E 


Recall now Theorems 6.5.2 and 6.5.3. The implications of these results, with some 
additional conditions, are that the MLE of 0 is asymptotically unbiased for 0 and that 
the asymptotic variance of the MLE is at the information lower bound. This is often 
interpreted to mean that, with large samples, the MLE makes full use of the information 
about @ contained in the data. 


Summary of Section 8.1 


e Anestimator comes closest (using squared distance) on average to its mean (see 
Theorem 8.1.1), so we can restrict attention to unbiased estimators for quantities 
of interest. 


The Rao—Blackwell theorem says that we can restrict attention to functions of a 
sufficient statistic when looking for an estimator minimizing MSE. 


e When a sufficient statistic is complete, then any function of that sufficient statis- 
tic is UMVU for its mean. 

e The Cramer—Rao lower bound gives a lower bound on the variance of an unbi- 
ased estimator and a method for obtaining an estimator that has variance at this 
lower bound when such an estimator exists. 


444 Section 8.1: Optimal Unbiased Estimation 


EXERCISES 


8.1.1 Suppose that a statistical model is given by the two distributions in the following 
table. 


Sas) 1/3 1/6 1/12 5/12 


fo(s)| 1/2 1/4 1/6 1/12 


If T : {1,2,3,4} — {1,2,3,4} is defined by 711) = T(2) = 1 and T(s) = s 
otherwise, then prove that T is a sufficient statistic. Derive the conditional distributions 
of s given T (s) and show that these are independent of 8. 


8.1.2 Suppose that (x1, ..., Xn) is a sample from a distribution with mean u and vari- 
ance a”. Prove that s? = (n — 1)7! $}; (x; — x)? is unbiased for o°. 
8.1.3 Suppose that (x1, ..., Xn) is a sample from an N (u, a3) distribution, where u € 


R! is unknown and o? is known. Determine a UMVU estimator of the second moment 
u? + o. 

8.1.4 Suppose that (x1, .. . , Xn) is a sample from an N (u, o?) distribution, where u € 
R! is unknown and o? is known. Determine a UMVU estimator of the first quartile 
U + 0020.25. 

8.1.5 Suppose that (x1, .. . , Xn) is a sample from an N (u, o?) distribution, where u € 
R! is unknown and o? is known. Is 2x +3 a UMVU estimator of anything? If so, what 
is it UMVU for? Justify your answer. 

8.1.6 Suppose that (x1, ...,Xn) is a sample from a Bernoulli(@) distribution, where 
0 e [0, 1] is unknown. Determine a UMVU estimator of @ (use the fact that a minimal 
sufficient statistic for this model is complete). 

8.1.7 Suppose that (x1, ...,x,) is a sample from a Gamma(ao, £) distribution, where 
ao is known and $ > 0 is unknown. Using the fact that x is a complete sufficient 
statistic (see Challenge 8.1.27), determine a UMVU estimator of B7!. 

8.1.8 Suppose that (x1, .. . , Xn) is a sample from an N (uo, c°) distribution, where uo 
is known and ø? > 0 is unknown. Show that >~"_, (xi — ko) is a sufficient statistic 
ior this problem. Using the fact that it is complete, determine a UMVU estimator for 
o^. 

8.1.9 Suppose a statistical model comprises all continuous distributions on R!. Based 
on a sample of n, determine a UMVU estimator of P((—1, 1)), where P is the true 
probability measure. Justify your answer. 

8.1.10 Suppose a statistical model comprises all continuous distributions on R! that 
have a finite second moment. Based on a sample of n, determine a UMVU estimator 
of u?, where u is the true mean. Justify your answer. (Hint: Find an unbiased esti- 
mator for n = 2, Rao—Blackwellize this estimator for a sample of n, and then use the 
Lehmann—Scheffé theorem.) 

8.1.11 The estimator determined in Exercise 8.1.10 is also unbiased for u? when the 
statistical model comprises all continuous distributions on R! that have a finite first 
moment. Is this estimator still UMVU for 2? 


Chapter 8: Optimal Inferences 445 


PROBLEMS 


8.1.12 Suppose that (x1, ..., Xn) is a sample from a Uniform[0, @] distribution, where 
0 > 0 is unknown. Show that x(n) is a sufficient statistic and determine its distribution. 
Using the fact that x(n) is complete, determine a UMVU estimator of 8. 

8.1.13 Suppose that (x1, ..., Xn) is a sample from a Bernoulli (0) distribution, where 
0 e [0, 1] is unknown. Then determine the conditional distribution of (x1, ..., Xn), 
given the value of the sufficient statistic x. 

8.1.14 Prove that L(0, a) = (0 — a)’ satisfies 


L(0, aa, + (1 —a)az) < aL (0,a1)+ (1 — a) L (0, a2) 


when a ranges in a subinterval of R!. Use this result together with Jensen’s inequality 
(Theorem 3.6.4) to prove the Rao—Blackwell theorem. 


8.1.15 Prove that L (0, a) = |0 — a| satisfies 
L(@, aa, + (1 —a)az) < aL (0,a1) + 0 — a)L (0, a2) 


when a ranges in a subinterval of R!. Use this result together with Jensen’s inequality 
(Theorem 3.6.4) to prove the Rao—Blackwell theorem for absolute error. (Hint: First 
show that |x + y| < |x| + |y| for any x and y.) 

8.1.16 Suppose that (x1, ..., Xn) is a sample from an N(u, 07) distribution, where 
(u, o?) e R! x (0, oo) is unknown. Show that the optimal estimator (in the sense 
of minimizing the MSE), of the form cs? for 7, is given by c = (n — 1)/ (n +1). 
Determine the bias of this estimator and show that it goes to 0 as n > oo. 

8.1.17 Prove that if a statistic T is complete for a model and U = A(T) for a 1-1 
function h, then U is also complete. 

8.1.18 Suppose that (x1, ..., Xn) is a sample from an N(u, o°) distribution, where 
(u, 07) € R! x (0, 00) is unknown. Derive a UMVU estimator of the standard devia- 
tion o. (Hint: Calculate the expected value of the sample standard deviation s.) 

8.1.19 Suppose that (x1, ..., Xn) is a sample from an N(u, 07) distribution, where 
(u, 07) € R! x (0, 00) is unknown. Derive a UMVU estimator of the first quartile 
u +02025. (Hint: Problem 8.1.17.) 

8.1.20 Suppose that (x1,...,x,) is a sample from an N(u, o2) distribution, where 
0 € Q = {u1, 42} is unknown and o? > 0 is known. Establish that x is a minimal 
sufficient statistic for this model but that it is not complete. 


8.1.21 Suppose that (x1,...,%,) is a sample from an N(y, o?) distribution, where 
HE R! is unknown and o? is known. Determine the information lower bound, for an 
unbiased estimator, when we consider estimating the second moment u? + Ge. Does 
the UMVU estimator in Exercise 8.1.3 attain the information lower bound? 

8.1.22 Suppose that (x1, ..., Xn) isa sample from a Gamma(ao, £) distribution, where 
ao is known and f > 0 is unknown. Determine the information lower bound for the 
estimation of £7! using unbiased estimators, and determine if the UMVU estimator 
obtained in Exercise 8.1.7 attains this. 

8.1.23 Suppose that (x1, ...,X,) is a sample from the distribution with density fg 
(x) = 6x9—! for x e [0, 1] and @ > 0 is unknown. Determine the information lower 


446 Section 8.2: Optimal Hypothesis Testing 


bound for estimating 0 using unbiased estimators. Does a UMVU estimator with vari- 
ance at the lower bound exist for this problem? 

8.1.24 Suppose that a statistic T is a complete statistic based on some statistical model. 
A submodel is a statistical model that comprises only some of the distributions in the 
original model. Why is it not necessarily the case that T is complete for a submodel? 
8.1.25 Suppose that a statistic T is a complete statistic based on some statistical model. 
If we construct a larger model that contains all the distributions in the original model 
and is such that any set that has probability content equal to 0 for every distribution in 
the original model also has probability content equal to 0 for every distribution in the 
larger model, then prove that T is complete for the larger model as well. 


CHALLENGES 


8.1.26 If X is a random variable such that E (X) either does not exist or is infinite, then 
show that E((X — c)*) = œ for any constant c. 


8.1.27 Suppose that (x1, ..., Xn) isa sample from a Gamma (ao, £) distribution, where 
ao is known and f > 0 is unknown. Show that x is a complete minimal sufficient 
statistic. 


8.2 | Optimal Hypothesis Testing 


Suppose we want to assess a hypothesis about the real-valued characteristic y(0) for 
the model {fg : 0 € Q}. Typically, this will take the form Ho : w(@) = wo, where we 
have specified a value for y. After observing data s, we want to assess whether or not 
we have evidence against Hp. 

In Section 6.3.3, we discussed methods for assessing such a hypothesis based on 
the plug-in MLE for w(@). These involved computing a P-value as a measure of how 
surprising the data s are when the null hypothesis is assumed to be true. If s is sur- 
prising for each of the distributions fg for which y(@) = wo, then we have evidence 
against Ho. The development of such procedures was largely based on the intuitive 
justification for the likelihood function. 


8.2.1 | The Power Function of a Test 


Closely associated with a specific procedure for computing a P-value is the concept 
of a power function B(@), as defined in Section 6.3.6. For this, we specified a critical 
value a, such that we declare the results of the test statistically significant whenever the 
P-value is less than or equal to a. The power £ (0) is then the probability of the P-value 
being less than or equal to a when we are sampling from fg. The greater the value 
of B(@), when w(@) Æ wo, the better the procedure is at detecting departures from 
Ho. The power function is thus a measure of the sensitivity of the testing procedure to 
detecting departures from Ho. 
Recall the following fundamental example. 


Chapter 8: Optimal Inferences 447 


EXAMPLE 8.2.1 Location Normal Model 
Suppose we have a sample (x1, ..., Xn) from the N (u, o?) model, where u € R! is 
unknown and o? > 0 is known, and we want to assess the null hypothesis Ho : u = Uo. 
In Example 6.3.9, we showed that a sensible test for this problem is based on the z- 
statistic a 
_ *— Ho 
ao/Jn’ 


with Z ~ N(O, 1) under Ho. The P-value is then given by 
ZA 
Pug (z 2 


Ho 
——]|]} =2]/1-©® ; 
SHAN tt | aie 
where ® denotes the N (0, 1) distribution function. 
In Example 6.3.18, we showed that, for critical value a, the power function of the 


z-test is given by 
sare) <2) = Ta 
P, {2|1—® <a}=P,(@® >l-= 
HC * car Oa 1-3 
Ho TH Ho TH 
1-—@® 2 ® —Z]— 
(ta em) (Ga em) 


because ¥ ~ N (u, o3/n). 
We see that specifying a value for a specifies a set of data values 
X — Ho 


e 


such that the results of the test are determined to be statistically significant whenever 


X — Ho 


oo//n 


Bu) 


R= fén: 0 


(x1,...,Xn) € R. Using the fact that Ọ is 1-1 increasing, we can also write R as 
|x = Ho if ( =) 
R = Nihke AnNa > @ l- 
í ! n FA 2 | 
= TEES ; Dat > ziza] : 
oo/s/n 

Furthermore, the power function is given by £ (u) = P, (R) and p (uo) = Pu (R) =a. 
| 


8.2.2 | Type | and Type II Errors 


We now adopt a different point of view. We are going to look for tests that are optimal 
for testing the null hypothesis Ho : w(@) = wo. First, we will assume that, having 
observed the data s, we will decide to either accept or reject Ho. If we reject Ho, then 
this is equivalent to accepting the alternative Ha : w(0) 4 wo. Our performance 
measure for assessing testing procedures will then be the probability that the testing 
procedure makes an error. 


448 Section 8.2: Optimal Hypothesis Testing 


There are two types of error. We can make a type I error — rejecting Ho when it is 
true — or make a type IT error — accepting Ho when Ho is false. Note that if we reject 
Ho, then this implies that we are accepting the alternative hypothesis Ha : y (0) # Wo. 

It turns out that, except in very artificial circumstances, there are no testing proce- 
dures that simultaneously minimize the probabilities of making the two kinds of errors. 
Accordingly, we will place an upper bound a, called the critical value, on the proba- 
bility of making a type I error. We then search among those tests whose probability of 
making a type I error is less than or equal to a, for a testing procedure that minimizes 
the probability of making a type II error. 

Sometimes hypothesis testing problems for real-valued parameters are distinguished 
as being one-sided or two-sided. For example, if 8 is real-valued, then Ho : 0 = 0o ver- 
sus Ha : 0 Æ 0o is a two-sided testing problem, while Ho : 0 < Oo versus Ha : 0 > 0o 
or Hy) : 0 > Oo versus Ha : 0 < Oo are examples of one-sided problems. Notice, 
however, that if we define 

w (0) = I0, @), 


then Ho : 0 < Oo versus Ha : 0 > Oo is equivalent to the problem Hp : w (0) = 0 
versus H4 : y (0) Æ 0. Similarly, if we define 


y (0) = I(~00,60) (0), 


then Ho : 0 > Oo versus Ha : O < Oo is equivalent to the problem Hp : w (0) = 0 
versus H4 : y (0) Æ 0. So the formulation we have adopted for testing problems about 
a general wy includes the one-sided problems as special cases. 


8.2.3 | Rejection Regions and Test Functions 


One approach to specifying a testing procedure is to select a subset R C S before we 
observe s. We then reject Hy whenever s e R and accept Hp whenever s ¢ R. The 
set R is referred to as a rejection region. Putting an upper bound on the probability of 
rejecting Ho when it is true leads to the following. 


Definition 8.2.1 A rejection region R satisfying 


Po(R) <a 


whenever y(@) = wo is called a size a rejection region for Ho. 


So (8.2.1) expresses the bound on the probability of making a type I error. 

Among all size a rejection regions R, we want to find the one (if it exists) that will 
minimize the probability of making a type II error. This is equivalent to finding the 
size a rejection region R that maximizes the probability of rejecting the null hypothesis 
when it is false. This probability can be expressed in terms of the power function of R 
and is given by £ (0) = Po(R) whenever w (0) Æ wo. 

To fully specify the optimality approach to testing hypotheses, we need one addi- 
tional ingredient. Observe that our search for an optimal size a rejection region R is 
equivalent to finding the indicator function [pr that satisfies 6(0) = Eg (Ir) = Po(R) < 


Chapter 8: Optimal Inferences 449 


a, when w(@) = wo, and maximizes 6(0) = Eo (Ir) = Po(R), when w(@) Æ wo. It 
turns out that, in a number of problems, there is no such rejection region. 

On the other hand, there is often a solution to the more general problem of finding 
a function g : S — [0, 1] satisfying 


BO) = Eo) < a, (8.2.2) 


when w(@) = yo, and maximizes 


BO) = Eo), 


when w(0@) 4 wo. We have the following terminology. 


Definition 8.2.2 We call ọ : S — [0,1] a test function and (0) = Eg() the 
power function associated with the test function ø. If @ satisfies (8.2.2) when 
y(@) = wọ, it is called a size a test function. If ọ satisfies Eg(g) = a when 


y(@) = wo, it is called an exact size a test function. A size a test function ø that 
maximizes f (0) = Eg(y) when w(@) Æ wo is called a uniformly most powerful 
(UMP) size a. test function. 


Note that ø = Jp is a test function with power function given by 6(@) = EgUr) = 
Po(R). 

For observed data s, we interpret ø (s) = 0 to mean that we accept Ho and interpret 
o(s) = 1 to mean that we reject Ho. In general, we interpret o(s) to be the conditional 
probability that we reject Ho given the data s. Operationally, this means that, after we 
observe s, we generate a Bernoulli(g(s)) random variable. If we get a 1, we reject 
Ho; if we get a0, we accept Ho. Therefore, by the theorem of total expectation, Eg (Q) 
is the unconditional probability of rejecting Hp. The randomization that occurs when 
0 < g(s) < 1 may seem somewhat counterintuitive, but it is forced on us by our search 
for a UMP size a test, as we can increase power by doing this in certain problems. 


8.2.4 | The Neyman-—Pearson Theorem 


For a testing problem specified by a null hypothesis Ho : w(@) = wo and a critical 
value a, we want to find a UMP size a test function ø. Note that a UMP size a test 
function go for Ho : y (8) = yo is characterized (letting 2, denote the power function 
of o) by 
Bo, @) <a, 
when y(@) = wo, and by 
By, ©) = Bo), 


when w(@) Æ wo, for any other size a test function g. 

Still, this optimization problem does not have a solution in general. In certain prob- 
lems, however, an optimal solution can be found. The following result gives one such 
example. It is fundamental to the entire theory of optimal hypothesis testing. 


450 Section 8.2: Optimal Hypothesis Testing 


Theorem 8.2.1 (Neyman—Pearson) Suppose that Q = {00, 01} and that we want to 
test Ho : 0 = 0o. Then an exact size a test function go exists of the form 


1 for (s) /foo (s) > CO 


go(s)=4 7 Joi (8) /fao (S) = co 
0 fo, (S) / fay (S) < co 
for some y € [0, 1] and co > 0. This test is UMP size a. 


PROOF | See Section 8.5 for the proof of this result. E 


The following result can be established by a simple extension of the proof of the 
Neyman-Pearson theorem. 


Corollary 8.2.1 If ọ is a UMP size a test, then ø (s) = gọ(s) everywhere except 


possibly on the boundary B = {s : fo, (s)/foo(s) = co}. Furthermore, g has exact 
size a unless the power of a UMP size a test equals 1. 


PROOF | See Challenge 8.2.22. E 


Notice the intuitive nature of the test given by the Neyman-—Pearson theorem, for 
(8.2.3) indicates that we categorically reject Hp as being true when the likelihood ratio 
of 8; versus 99 is greater than the constant co, and we accept Hp when it is smaller. 
When the likelihood ratio equals cg, we randomly decide to reject Hg with probability 
y. Also, Corollary 8.2.1 says that a UMP size a test is basically unique, although there 
are possibly different randomization strategies on the boundary. 

The proof of the Neyman—Pearson theorem reveals that co is the smallest real num- 


ber such that AO) 
6, (s 
Po ( l > cn) <a (8.2.4) 
: Soo (s) 
and 
a—Poo| fu >co AG 
JO 6,8 
ANTE Po (5 = co) #0 
al (as (> o ae = 
0 otherwise. 


We use (8.2.4) and (8.2.5) to calculate cg and y , and so determine the UMP size a test, 
in a particular problem. 

Note that the test is nonrandomized whenever Po, (fo, (s)/fo,(s) > co) = a, as 
then y = 0, i.e., we categorically accept or reject Hp after seeing the data. This 
always occurs whenever the distribution of fo, (s)/fo, (s) is continuous when s ~ Pog- 
Interestingly, it can happen that the distribution of the ratio is not continuous even when 
the distribution of s is continuous (see Problem 8.2.17). 

Before considering some applications of the Neyman—Pearson theorem, we estab- 
lish the analog of the Rao—Blackwell theorem for hypothesis testing problems. Given 


Chapter 8: Optimal Inferences 451 


the value of the sufficient statistic U(s) = u, we denote the conditional probability 
measure for the response s by P(-|U = u) (by Theorem 8.1.2, this probability mea- 
sure does not depend on 0). For test function g put øy(s) equal to the conditional 
expectation of o given the value of U (s), namely, 


puls) = Epy.ju=u(s)) (9). 


Theorem 8.2.2 Suppose that U is a sufficient statistic and ø is a size a test function 
for Hp : w(@) = wo. Then gy is a size a test function for Hp : w(O) = Wo that 


depends on the data only through the value of U. Furthermore, g and gy have the 
same power function. 


PROOF | It is clear that gy(s1) = gy(s2) whenever U(s1) = U(s2), and so gy 


depends on the data only through the value of U. Now let Py y denote the marginal 
probability measure of U induced by Py. Then by the theorem of total expectation, we 


have Eg (9) = Ep, y (Epc. ju=u)()) = Emu (pu) = Eo (pu). Now Eo(g) < a when 
yw(@) = wo, which implies that Eg(gy) < a when w(@) = wo, and (0) = Eg(g) = 
Eo (pu) when yw(@) # yo. 
This result allows us to restrict our search for a UMP size a test to those test functions 
that depend on the data only through the value of a sufficient statistic. 

We now consider some applications of the Neyman—Pearson theorem. The follow- 
ing example shows that this result can lead to solutions to much more general problems 
than the simple case being addressed. 


EXAMPLE 8.2.2 Optimal Hypothesis Testing in the Location Normal Model 
Suppose that (x1, ..., Xn) is a sample from an N (u, 03) distribution, where u € Q = 
{uo 41} and o? > 0 is known, and we want to test Hp : u = uo versus Ha : u = Ly. 
The likelihood function is given by 


n 
L(a |x, -+3 Xn) = o0(-35 (i - 07) 
205 


and x is a sufficient statistic for this restricted model. 
By Theorem 8.2.2, we can restrict our attention to test functions that depend on the 
data through x. Now X ~ N(u, oĉ jn) so that 


fa (x) exp (-4 (x zz m?) 
AR] 


-5 (¥ = 25u + ui 3? +2 uo — ‘) 


( 
= (5 (a m8) a(z (ui - ‘)) 


452 Section 8.2: Optimal Hypothesis Testing 


Therefore, 
P fun W > co 
ON Sfo 
n = n 2 2 
= Pug (où (41 — Ho) Jeso(-s (ui z ‘)) a o) 
= Pu | exp] — (u1 — uo) X | > coexp = (ui - u) 
o? 202 
7 o n 2 2 
= Puy (u1 — Ho) X > — in co exp 53 (Hi - #8) 
0 
pen 
Ho (=54t > ct) Hi > Ho 
= a 
HO (54 <4) H1 < Ho» 
where 
2 
n o n 
E —_— In] co exp 5 (ut - x3) — Hor: 
oo |n (“1 — Ho) 205 


Using (8.2.4), when 4; > uo, we select co so that c) = Z1—a; when u1 < uo, we 
select co so that c) = Za. These choices imply that 


and, by (8.2.5), y = 0. 
So the UMP size a test is nonrandomized. When u; > uo, the test is given by 


1 X > uot zia 
po &) = (8.2.6) 
0 X < uo + Taz l—a 


When u4 < Lo, the test is given by 


1 x< Hot ike 
p E= (8.2.7) 
0 X > uo + Za. 


Notice that the test function in (8.2.6) does not depend on u; in any way. The 
subsequent implication is that this test function is UMP size a for Ho : u = fo versus 
Hı : u = u; for any uw, > Mo. This implies that pọ is UMP size a for Ho : u = uo 
versus the alternative Ha : u > uo. 


Chapter 8: Optimal Inferences 453 


Furthermore, we have 


X-u Mor zia) 


Bo, D) = P, (2 > no+ Bee) =P, (2 AS 


Ja 
a i o (E-E +z- a). 


Note that this is increasing in u, which implies that gg is a size a test function for 
Ao: u < uo versus Ha : u > Mg. Observe that, if g is a size a test function for 
Ao: u < uọ versus Ha : u > uo, then it is also a size a test for Ho : u = uo versus 
Ha : u > uo. From this, we conclude that gp is UMP size a for Ho : u < Mo versus 
Ha : u > Mo. Similarly (see Problem 8.2.12), it can be shown that g5 in (8.2.7) is 
UMP size a for Hp : u > Mo versus Ha : 4 < Ho. 

We might wonder if a UMP size a test exists for the two-sided problem Hp : u = 
Ho versus Ha : u Æ uo. Suppose that o is a size a UMP test for this problem. Then ø 
is also size a for Ho : u = Mo versus Ha : u = wy when u; > Mo. Using Corollary 
8.2.1 and the preceding developments (which also shows that there does not exist a test 
of the form (8.2.3) having power equal to 1 for this problem), this implies that 9 = go 
(the boundary B has probability 0 here). But g is also UMP size a for Hp : u = Lo 
versus Ha : y = u; when u; < yo; thus, by the same reasoning, g = gj. But clearly 
P0 # Yo, 50 there is no UMP size a test for the two-sided problem. 

Intuitively, we would expect that the size a test given by 


1 SEL > Z]—a/2 
9 (%) = (8.2.8) 
k-u 
cola | <71-a/2 


would be a good test to use, but it is not UMP size a. It turns out, however, that the test 
in (8.2.8) is UMP size a among all tests satisfying J, (uo) < a and $, (u) = a when 
HF fo. ll 

Example 8.2.2 illustrated a hypothesis testing problem for which no UMP size a 


test exists. Sometimes, however, by requiring that the test possess another very natural 
property, we can obtain an optimal test. 


Definition 8.2.3 A test g that satisfies £ (0) < a, when y(@) = yo, and B,(@) = 
a, when yw(@) Æ wo, is said to be an unbiased size a test for the hypothesis testing 


problem Ho : w(@) = 


So (8.2.8) is a UMP unbiased size a test. An unbiased test has the property that the 
probability of rejecting the null hypothesis, when the null hypothesis is false, is always 
greater than the probability of rejecting the null hypothesis, when the null hypothesis 
is true. This seems like a very reasonable property. In particular, it can be proved that 
any UMP size a is always an unbiased size a test (Problem 8.2.14). We do not pursue 
the theory of unbiased tests further in this text. 

We now consider an example which shows that we cannot dispense with the use of 
randomized tests. 


454 Section 8.2: Optimal Hypothesis Testing 


EXAMPLE 8.2.3 Optimal Hypothesis Testing in the Bernoulli(@) Model 
Suppose that (x1, ..., Xn) is a sample from a Bernoulli(@) distribution, where 0 € Q = 
{90, 91}, and we want to test Hp : 0 = Oo versus Ha : 0 = 01, where 01 > Oo. Then 
nx is a minimal sufficient statistic and, by Theorem 8.2.2, we can restrict our attention 
to test functions that depend on the data only through nx. 

Now 1X ~ Binomial(n, 0) , so 


iO) ORO Ey (= Ti 


fo (nx) OA- 0o N00) -8o 
Therefore, 
Po, (a x eo) 
So nX) 
01 nX 1-0; n—nX 
=P, at 
»(G) (=a) a 
is o 1-0)" (aN 
= oTo do ONT Op 
z 0, 1—80 1-8 \ 7” 
=P, X ii l 
af» ten Ao )]> ra (a) ) 
1—01 
= Po, | nX meo (T=) = Pa (nX > ef 
= Po, | nX > TRE = Pa (nX > co) 
In (12; o ) 
because 


as 0/ (1 — 0) is increasing in 0, which implies 01/(1 — 01) > 00/(1 — 00). 
Now, using (8.2.4), we choose co so that co is an integer satisfying 


Po, nX > ch) <a and Pa (nX > cy —-1) >a. 


Because nX ~ Binomial (n, 0o) is a discrete distribution, we see that, in general, we 
will not be able to achieve Pa, (nX > cy) = a exactly. So, using (8.2.5), 


_ a= Po, (nX > Co) 
Pay (nX =ch) 


will not be equal to 0. Then 
1 nx > c) 
po(ni)=4 y nk =eh 


om / 
0 nX < co 


Chapter 8: Optimal Inferences 455 


is UMP size a for Hp : 0 = Oo versus Ha : 6 = 01. Note that we can use statistical 
software (or Table D.6) for the binomial distribution to obtain c). 

For example, suppose n = 6 and 09 = 0.25. The following table gives the values 
of the Binomial (6, 0.25) distribution function to three decimal places. 


aa a Ca SSS 


F(x) | 0.178 0.534 0.831 0.962 0.995 1.000 1.000 


Therefore, if a = 0.05, we have that c) = 3 because Po.25 (nX > 3) = 1 — 0.962 = 
0.038 and P925 (nX > 2) = 1 — 0.831 = 0.169. This implies that 


_ 0.05 — (1 — 0.962) 


> = 0.012. 
4 0.962 


So with this test, we reject Ho : 0 = Oo categorically if the number of successes is 
greater than 3, accept Ho : 0 = Oo categorically when the number of successes is less 
than 3, and when the number of 1’s equals 3, we randomly reject Hp : 8 = 09 with 
probability 0.012 (e.g., generate U ~ Uniform[0, 1] and reject whenever U < 0.012). 

Notice that the test øọ does not involve 81, so indeed it is UMP size a for Ho : 0 = 
Oo versus Ha : 0 > 0o. Furthermore, using Problem 8.2.18, we have 


Po(nX > œ= >> (a-o 
k=cp+l 
Tint) 


1 
OR ee ee % (| — n—=co7l du. 
Creme ee 


Because 


1 
I u% (1 — u)" 77! du 
0 
is decreasing in 8, we must have that Pa (nX > co) is increasing in 0. Arguing as in 
Example 8.2.2, we conclude that gg is UMP size a for Hp : 0 < 0o versus Ha : 0 > Oo. 
Similarly, we obtain a UMP size a test for Hp : 0 < Oo versus Ha : 0 > 0o. As in 
Example 8.2.2, there is no UMP size a test for Ho : 0 = 0o versus Ha : 0 Æ 00, but 
there is a UMP unbiased size a test for this problem. E 


8.2.5 | Likelihood Ratio Tests (Advanced) 


In the examples considered so far, the Neyman—Pearson theorem has led to solutions 
to problems in which Hp or H4 are not just single values of the parameter, even though 
the theorem was only stated for the single-value case. We also noted, however, that 
this is not true in general (for example, the two-sided problems discussed in Examples 
8.2.2 and 8.2.3). 

The method of generalized likelihood ratio tests for Ho : w(@) = wo has been 
developed to deal with the general case. This is motivated by the Neyman—Pearson 


456 Section 8.2: Optimal Hypothesis Testing 


theorem, for observe that in (8.2.3), 


foils) _ L@i\s) 
fos) L(@o ls) 


Therefore, (8.2.3) can be thought of as being based on the ratio of the likelihood at 81 
to the likelihood at 09, and we reject Ho : 0 = 0o when the likelihood gives much more 
support to 0; than to 09. The amount of the additional support required for rejection is 
determined by co. The larger cg is, the larger the likelihood Z (01 |s) has to be relative 
to L (8o |s) before we reject Ho : 0 = Oo. 

Denote the overall MLE of 0 by (s), and the MLE, when 0 € Ho, by 6 Hy(s). So 
we have 


L(@ Is) < LO@m(s) Is) 


for all 9 such that y (0) = wo. The generalized likelihood ratio test then rejects Ho 
when 
L(O@(s) |s) 


L@m(s) 15) 


is large, as this indicates evidence against Ho being true. 
How do we determine when (8.2.9) is large enough to reject? Denoting the ob- 
served data by so, we do this by computing the P-values 


(8.2.9) 


Py (Gee : Fee) S 
LOm(s)Is)  L(@ m (s0) | so) 

when 0 e Hp. Small values of (8.2.10) are evidence against Ho. Of course, when 

y(@) = wo for more than one value of 0, then it is not clear which value of (8.2.10) to 

use. It can be shown, however, that under conditions such as those discussed in Section 

6.5, if s corresponds to a sample of n values from a distribution, then 


L(@s)|s) D 


2 In — 5 y°(dimQ — dim Ho) 
L(0 m (s) Is) 


asn — oo, whenever the true value of 8 is in Ho. Here, dim Q and dim Hp are the 
dimensions of these sets. This leads us to a test that rejects Hy whenever 


Ls) |s) 


2 In — 
LO@m(s) |s) 


(8.2.11) 


is greater than a particular quantile of the y? (dim Q — dim Ho) distribution. 

For example, suppose that in a location-scale normal model, we are testing Ho : 
u = uo. Then Q = R! x [0, oo), Ho = {uo} x [0, œœ), dim Q = 2, dim Ho = 1, and, 
for a size 0.05 test, we reject whenever (8.2.11) is greater than HG oe (1). Note that, 
strictly speaking, likelihood ratio tests are not derived via optimality considerations. 
We will not discuss likelihood ratio tests further in this text. 


Chapter 8: Optimal Inferences 457 


Summary of Section 8.2 


e In searching for an optimal hypothesis testing procedure, we place an upper 
bound on the probability of making a type I error (rejecting Ho when it is true) 
and search for a test that minimizes the probability of making a type II error 
(accepting Ho when it is false). 

e The Neyman—Pearson theorem prescribes an optimal size a test when Hp and 
Hi, each specify a single value for the full parameter 0. 

e Sometimes the Neyman—Pearson theorem leads to solutions to hypothesis test- 
ing problems when the null or alternative hypotheses allow for more than one 
possible value for 0, but in general we must resort to likelihood ratio tests for 
such problems. 


EXERCISES 


8.2.1 Suppose that a statistical model is given by the two distributions in the following 
table. 


fa (s) 1/3 1/6 1/12 5/12 


fo(s)| 1/2 1/4 1/6 1/12 


Determine the UMP size 0.10 test for testing Hp : 0 = a versus H, : 0 = b. What is 
the power of this test? Repeat this with the size equal to 0.05. 

8.2.2 Suppose for the hypothesis testing problem of Exercise 8.2.1, a statistician de- 
cides to generate U ~ Uniform[0, 1] and reject Hp whenever U < 0.05. Show that 
this test has size 0.05. Explain why this is not a good choice of test and why the test 
derived in Exercise 8.2.1 is better. Provide numerical evidence for this. 

8.2.3 Suppose an investigator knows that an industrial process yields a response vari- 
able that follows an N (1, 2) distribution. Some changes have been made in the indus- 
trial process, and the investigator believes that these have possibly made a change in 
the mean of the response (not the variance), increasing its value. The investigator wants 
the probability of a type I error occurring to be less than 1%. Determine an appropriate 
testing procedure for this problem based on a sample of size 10. 

8.2.4 Suppose you have a sample of 20 from an N(w, 1) distribution. You form a 
0.975-confidence interval for u and use it to test Hp : u = 0 by rejecting Ho whenever 
0 is not in the confidence interval. 

(a) What is the size of this test? 

(b) Determine the power function of this test. 

8.2.5 Suppose you have a sample of size n = 1 from a Uniform[0, @] distribution, 
where 0 > 0 is unknown. You test Ho : 8 < 1 by rejecting Ho whenever the sampled 
value is greater than 1. 

(a) What is the size of this test? 

(b) Determine the power function of this test. 


8.2.6 Suppose you are testing a null hypothesis Ho : @ = 0, where 0 €e R!. You use a 
size 0.05 testing procedure and accept Ho. You feel you have a fairly large sample, but 


458 Section 8.2: Optimal Hypothesis Testing 


when you compute the power at +0.2, you obtain a value of 0.10 where 0.2 represents 
the smallest difference from 0 that is of practical importance. Do you believe it makes 
sense to conclude that the null hypothesis is true? Justify your conclusion. 

8.2.7 Suppose you want to test the null hypothesis Hp : u = 0 based on a sample of 
n from an N(u, 1) distribution, where u € {0, 2}. How large does n have to be so that 
the power at u = 2, of the optimal size 0.05 test, is equal to 0.99? 

8.2.8 Suppose we have available two different test procedures in a problem and these 
have the same power function. Explain why, from the point of view of optimal hypoth- 
esis testing theory, we should not care which test is used. 

8.2.9 Suppose you have a UMP size a test ọ for testing the hypothesis Ho : y(@) = 
Wo, where y is real-valued. Explain how the graph of the power function of another 
size a test that was not UMP would differ from the graph of the power function of g. 


COMPUTER EXERCISES 


8.2.10 Suppose you have a coin and you want to test the hypothesis that the coin is 
fair, i.e., you want to test Hp : 0 = 1/2 where @ is the probability of getting a head 
on a single toss. You decide to reject Ho using the rejection region R = {0, 1,7, 8} 
based on n = 10 tosses. Tabulate the power function for this procedure for 0 € 
{0, 1/8, 2/8,..., 7/8, 1}. 

8.2.11 On the same graph, plot the power functions for the two-sided z-test of Ho : 
u = 0 for samples of sizes n = 1, 4, 10, 20, and 100 based on a = 0.05. 

(a) What do you observe about these graphs? 

(b) Explain how these graphs demonstrate the unbiasedness of this test. 


PROBLEMS 


8.2.12 Prove that g9 in (8.2.7) is UMP size a for Ho : u > uo versus Hy : u < Mo. 
8.2.13 Prove that the test function ø (s) = a for every s € S is an exact size a test 
function. What is the interpretation of this test function? 

8.2.14 Using the test function in Problem 8.2.13, show that a UMP size a test is also a 
UMP unbiased size a test. 

8.2.15 Suppose that (x1, ...,x,) is a sample from a Gamma(ao, £) distribution, where 
ao is known and $ > 0 is unknown. Determine the UMP size a test for testing Ho : 
B = Bo versus Ha : P = 1, where F; > £o. Is this test UMP size a for Ho : P < Bo 
versus Ha : B > Bo? 

8.2.16 Suppose that (x1, ... , Xn) is a sample from an N(o, 07) distribution, where 
Ho is known and o? > 0 is unknown. Determine the UMP size a test for testing 
Hy : of = o? versus Ha : o? = o? where o? < o. Is this test UMP size a for 
Ho: o? < o? versus H4 : o? > a3? 

8.2.17 Suppose that (x1, ..., Xn) is a sample from a Uniform[0, 0] distribution, where 
0 > 0 is unknown. Determine the UMP size a test for testing Ho : 0 = 0o versus 
Ha : 0 = 01, where 89 < 01. Is this test function UMP size a for Ho : 0 < Oo versus 
H,:0> 060? 


Chapter 8: Optimal Inferences 459 


8.2.18 Suppose that F is the distribution function for the Binomial (n, 0) distribution. 
Then prove that 


Ta+1) 


1 
x n—-x-1 
Tat+bDF@—-x) Jo ya—y) ay 


F(x) = 
for x = 0, 1,...,n—1. This establishes a relationship between the binomial probability 
distribution and the beta function. (Hint: Integration by parts.) 

8.2.19 Suppose that F is the distribution function for the Poisson (4) distribution. Then 
prove that 


1 CO 
F(x) = =f yre dy 
x! pi 


for x = 0,1,.... This establishes a relationship between the Poisson probability 
distribution and the gamma function. (Hint: Integration by parts.) 
8.2.20 Suppose that (x1, ...,Xn) is a sample from a Poisson(4) distribution, where 
A > Ois unknown. Determine the UMP size a test for Hp : A = Ap versus H4 : À = 214, 
where Ag < 21. Is this test function UMP size a for Hp : A < Ao versus H; : A > Ag? 
(Hint: You will need the result of Problem 8.2.19.) 
8.2.21 Suppose that (x1, ..., Xn) is a sample from an N (u, ø?) distribution, where 
(u,a?) € R! x (0,00) is unknown. Derive the form of the exact size a likelihood 
ratio test for testing Ho : u = Mop versus Ho : u Æ Mo. 
8.2.22 (Optimal confidence intervals) Suppose that for model { fg : 0 € Q} we have a 
UMP size a test function g,,, for Ho : y(@) = wo, for each possible value of yo. Sup- 
pose further that each ø „, only takes values in {0, 1}, i.e., each g,,, is a nonrandomized 
size a test function. 
(a) Prove that 

C(s) = {Wo : Gy, (8) = 9} 


satisfies 
Po(w(@) € C(s)) >1—-a 


for every 0 € Q. Conclude that C (s) is a (1 — a)-confidence set for y (0). 
(b) If C* is a (1 — a)-confidence set for y (0), then prove that the test function defined 
by 

l wo £ C(s) 

Py (S) = 

0 wo E€ C(s) 
is size a for Ho: y (0) = wo. 
(c) Suppose that for each value yo, the test function g,,, is UMP size a for testing 
Ho : w(@) = Wo versus Ho : w (0) Æ Wo. Then prove that 


Py(y(0*) € C(s)) (8.2.12) 


is minimized, when y(@) # wo, among all (1 — a)-confidence sets for y(@). The 
probability (8.2.12) is the probability of C containing the false value y(@*), and a 


460 Section 8.3: Optimal Bayesian Inferences 


(1 — a)-confidence region that minimizes this probability when y(@) Æ yo is called a 
uniformly most accurate (UMA) (1 — a)-confidence region for y(@). 


CHALLENGES 


8.2.23 Prove Corollary 8.2.1 in the discrete case. 


8.3 | Optimal Bayesian Inferences 


We now add the prior probability measure II with density m. As we will see, this 
completes the specification of an optimality problem, as now there is always a solution. 
Solutions to Bayesian optimization problems are known as Bayes rules. 

In Section 8.1, the unrestricted optimization problem was to find the estimator T 
of y (0) that minimizes MSEọ (T) = Eo((T — w(0))°), for each 0 € Q. The Bayesian 
version of this problem is to minimize 


En (MSE@(T)) = En (Eo ((T — y@)))). (8.3.1) 


By the theorem of total expectation (Theorem 3.5.2), (8.3.1) is the expected value of 
the squared error (T (s) — w(@))* under the joint distribution on (0, s) induced by the 
conditional distribution for s, given 0 (the sampling model), and by the marginal dis- 
tribution for 0 (the prior distribution of 0). Again, by the theorem of total expectation, 
we can write this as 


En (MSE9(T)) = Eu (Encs (1 — w@)))), (8.3.2) 


where II(-|s) denotes the posterior probability measure for 0, given the data s (the 
conditional distribution of 0 given s), and M denotes the prior predictive probability 
measure for s (the marginal distribution of s). 

We have the following result. 


Theorem 8.3.1 When (8.3.1) is finite, a Bayes rule is given by 


T(s) = Engis) (v@)), 


namely, the posterior expectation of y(@). 


PROOF | First, consider the expected posterior squared error 
Ene is) (7) - yO”) 
of an estimate 7’(s). By Theorem 8.1.1 this is minimized by taking T’(s) equal to 


T(s) = Enç |s) (y (0)) (note that the “random” quantity here is 0). 
Now suppose that T’ is any estimator of y(@). Then we have just shown that 


0 < Enc s) ((76) = vo’) < Enis) (To = vy’) 


Chapter 8: Optimal Inference Methods 461 


and thus, 


En MSE4(T)) = Ew (Enci (T6) - v 0)®) 
< Em (Enci (T6) - y@)))) = En(MSE6(T'). 


Therefore, T minimizes (8.3.1) and is a Bayes rule. E 


So we see that, under mild conditions, the optimal Bayesian estimation problem 
always has a solution and there is no need to restrict ourselves to unbiased estimators, 
etc. 

For the hypothesis testing problem Ho : w(0) = wo, we want to find the test 
function ø that minimizes the prior probability of making an error (type I or type II). 
Such a g is a Bayes rule. We have the following result. 


Theorem 8.3.2 A Bayes rule for the hypothesis testing problem Ho : w(@) = wo 
is given by 


OyO) = wo} ls) < Udy @ F woh ls) 


pols) = 
otherwise. 


Consider test function g and let 7 y@)=vo} (0) denote the indicator function 
of the set {9 : w(@) = wo} (so oa) = | when w(@) = wo and equals 0 
otherwise). Observe that g(s) is the probability of rejecting Ho, having observed s, 
which is an error when J;,,(9)=y,}(0) = 1; 1 — o (s) is the probability of accepting Ho, 
having observed s, which is an error when Jy(9)=y,}(0) = 0. Therefore, given s and 
0, the probability of making an error is 


e0, s) = o(s)Ly@)=yo}) + A — o(s)) A = Ly @) =o} O). 


By the theorem of total expectation, the prior probability of making an error (taking 
the expectation of e(@, s) under the joint distribution of (0, s)) is 


Eu (Env js) (e(0, s))) A (8.3.3) 


As in the proof of Theorem 8.3.1, if we can find g that minimizes Er(. |s) (e (0, s)) for 
each s, then g also minimizes (8.3.3) and is a Bayes rule. 

Using Theorem 3.5.4 to pull ø (s) through the conditional expectation, and the fact 
that Enç 1s) Z4(@)) = I (A |s) for any event A, then 


Engis) (€@,5)) = p0) O) = vols) + — g@)) 0 — Wy @) = wo}ls)). 
Because g(s) €e [0, 1], we have 


min{IT({y@) = wo} ls), 1 -— Id y@) = wo}ls)} 
< gS) Ny @) = yo} ls) + -= g(s) d — My @) = wo}ls)). 


462 Section 8.3: Optimal Bayesian Inferences 


Therefore, the minimum value of Enç. |s) (e(0, 5)) is attained by p(s) = go(s). E 


Observe that Theorem 8.3.2 says that the Bayes rule rejects Ho whenever the pos- 
terior probability of the null hypothesis is less than or equal to the posterior probability 
of the alternative. This is an intuitively satisfying result. 

The following problem does arise with this approach, however. We have 


Eq: T 0 
T({y(@) = wo} ls) = Anetta DAEN 
2 Mevo JOO WO Yo) 
g m(s) : 


When II ({y(@) = wo}) = 0, (8.3.4) implies that T ({w (0) = wo} |s = 0) for every s. 
Therefore, using the Bayes rule, we would always reject Hp no matter what data s are 
obtained, which does not seem sensible. As discussed in Section 7.2.3, we have to be 
careful to make sure we use a prior I that assigns positive mass to Ho if we are going 
to use the optimal Bayes approach to a hypothesis testing problem. 


(8.3.4) 


Summary of Section 8.3 


e Optimal Bayesian procedures are obtained by minimizing the expected perfor- 
mance measure using the posterior distribution. 


e In estimation problems, when using squared error as the performance measure, 
the posterior mean is optimal. 


e In hypothesis testing problems, when minimizing the probability of making an 
error as the performance measure, then computing the posterior probability of 
the null hypothesis and accepting Ho when this is greater than 1/2 is optimal. 


EXERCISES 


8.3.1 Suppose that S = {1,2,3},Q = {1,2}, with data distributions given by the 
following table. We place a uniform prior on @ and want to estimate 0. 


s=l s=2 s=3 


fis) | 1/6 1/6 2/3 
h(s) | 1/4 1/4 1/2 


Using a Bayes rule, test the hypothesis Ho : 0 = 2 when s = 2 is observed. 

8.3.2 For the situation described in Exercise 8.3.1, determine the Bayes rule estimator 
of 0 when using expected squared error as our performance measure for estimators. 
8.3.3 Suppose that we have a sample (x1,...,x,) from an N(u, o?) distribution, 
where u is unknown and o? is known, and we want to estimate u using expected 
squared error as our performance measure for estimators. If we use the prior distrib- 
ution u ~ N(u, t2), then determine the Bayes rule for this problem. Determine the 
limiting Bayes rule as to oo. 


Chapter 8: Optimal Inference Methods 463 


8.3.4 Suppose that we observe a sample (x1, ..., Xn) from a Bernoulli(@) distribution, 
where @ is completely unknown, and we want to estimate 0 using expected squared 
error as our performance measure for estimators. If we use the prior distribution 0 ~ 
Beta(a, p), then determine a Bayes rule for this problem. 

8.3.5 Suppose that (x1, ..., Xn) is a sample from a Gamma(ao, £) distribution, where 
ao is known, and 8 ~ Gamma(r9, vo), where to and vg are known. If we want to 
estimate # using expected squared error as our performance measure for estimators, 
then determine the Bayes rule. Use the weak (or strong) law of large numbers to 
determine what this estimator converges to as n —> oo. 

8.3.6 For the situation described in Exercise 8.3.5, determine the Bayes rule for esti- 
mating £7! when using expected squared error as our performance measure for esti- 
mators. 

8.3.7 Suppose that we have a sample (x1,...,Xn) from an N(u, o?) distribution, 
where u is unknown and o? is known, and we want to find the test of Hp : u = uo 
that minimizes the prior probability of making an error (type I or type II). If we use the 
prior distribution u ~ poltu } + (1 — po)N (uo, tÊ), where po € (0, 1) is known (i.e., 
the prior is a mixture of a distribution degenerate at wo and an N (uo, T) distribution), 
then determine the Bayes rule for this problem. Determine the limiting Bayes rule as 
to — œ. (Hint: Make use of the computations in Example 7.2.13.) 

8.3.8 Suppose that we have a sample (x1, ..., Xn) from a Bernoulli (0) distribution, 
where @ is unknown, and we want to find the test of Hp : 0 = Oo that minimizes the 
prior probability of making an error (type I or type II). If we use the prior distribution 
0 ~ polioa} + (1 — po)Uniform[0, 1], where po e (0, 1) is known (i.e., the prior is a 
mixture of a distribution degenerate at @9 and a uniform distribution), then determine 
the Bayes rule for this problem. 


PROBLEMS 


8.3.9 Suppose that Q = {01, 02}, that we put a prior z on Q, and that we want to esti- 
mate 0. Suppose our performance measure for estimators is the probability of making 
an incorrect choice of 0. If the model is denoted {fg : 0 € Q}, then obtain the form of 
the Bayes rule when data s are observed. 

8.3.10 For the situation described in Exercise 8.3.1, use the Bayes rule obtained via the 
method of Problem 8.3.9 to estimate 0 when s = 2. What advantage does this estimate 
have over that obtained in Exercise 8.3.2? 

8.3.11 Suppose that (x1, ..., Xn) is a sample from an N(w, o°) distribution where 
(u, 07) € R! x (0, 00) is unknown, and want to estimate using expected squared 
error as our performance measure for estimators. Using the prior distribution given by 


uo? ~ N(uo, 790°), 


and using 


1 
ik Gamma (a9, 2o), 


where uo, Tê, ao, and pọ are fixed and known, then determine the Bayes rule for u. 


464 Section 8.4: Decision Theory (Advanced) 


8.3.12 (Model selection) Generalize Problem 8.3.9 to the case Q = {61,..., Ox}. 


CHALLENGES 


8.3.13 In Section 7.2.4, we described the Bayesian prediction problem. Using the 
notation found there, suppose we wish to predict £ € R! using a predictor T(s). If we 
assess the accuracy of a predictor by 


E((T(s) — 0°) = En(Ep, (Eos (8) — 1)))), 


then determine the prior predictor that minimizes this quantity (assume all relevant 
expectations are finite). If we observe sg, then determine the best predictor. (Hint: 
Assume all the probability measures are discrete.) 


8.4 | Decision Theory (Advanced) 


To determine an optimal inference, we chose a performance measure and then at- 
tempted to find an inference, of a given type, that has optimal performance with respect 
to this measure. For example, when considering estimates of a real-valued character- 
istic of interest y(@), we took the performance measure to be MSE and then searched 
for the estimator that minimizes this for each value of 0. 

Decision theory is closely related to the optimal approach to deriving inferences, 
but it is a little more specialized. In the decision framework, we take the point of view 
that, in any statistical problem, the statistician is faced with making a decision, e.g., 
deciding on a particular value for y(@). Furthermore, associated with a decision is 
the notion of a loss incurred whenever the decision is incorrect. A decision rule is a 
procedure, based on the observed data s, that the statistician uses to select a decision. 
The decision problem is then to find a decision rule that minimizes the average loss 
incurred. 

There are a number of real-world contexts in which losses are an obvious part of 
the problem, e.g., the monetary losses associated with various insurance plans that an 
insurance company may consider offering. So the decision theory approach has many 
applications. It is clear in many practical problems, however, that losses (as well as 
performance measures) are somewhat arbitrary components of a statistical problem, 
often chosen simply for convenience. In such circumstances, the approaches to deriv- 
ing inferences described in Chapters 6 and 7 are preferred by many statisticians. 

So the decision theory model for inference adds another ingredient to the sampling 
model (or to the sampling model and prior) to derive inferences — the loss function. To 
formalize this, we conceive of a set of possible actions or decisions that the statistician 
could take after observing the data s. This set of possible actions is denoted by A and 
is called the action space. To connect these actions with the statistical model, there 
is a correct action function A : Q— A such that A(Q) is the correct action to take 
when @ is the true value of the parameter. Of course, because we do not know 0, we 
do not know the correct action 4(Q), so there is uncertainty involved in our decision. 
Consider a simple example. 


Chapter 8: Optimal Inference Methods 465 


EXAMPLE 8.4.1 

Suppose you are told that an urn containing 100 balls has either 50 white and 50 black 
balls or 60 white and 40 black balls. Five balls are drawn from the urn without replace- 
ment and their colors are observed. The statistician’s job is to make a decision about 
the true proportion of white balls in the urn based on these data. 

The statistical model then comprises two distributions {P}, P2} where, using para- 
meter space Q = {1, 2}, P4 is the Hypergeometric(100, 50, 5) distribution (see Exam- 
ple 2.3.7) and P> is the Hypergeometric(100, 60,5) distribution. The action space is 
A = {0.5, 0.6}, and A : Q > Ais given by A(1) = 0.5 and A(2) = 0.6. The data are 
given by the colors of the five balls drawn. E 


We suppose now that there is also a loss or penalty L (0, a) incurred when we select 
actiona € Aand @ is true. If we select the correct action, then the loss is 0; it is greater 
than 0 otherwise. 


Definition 8.4.1 A loss function is a function L defined on Q x A and taking values 


in [0, oo) such that L(@, a) = 0 if and only if a = A(@). 


Sometimes the loss can be an actual monetary loss. Actually, decision theory is a 
little more general than what we have just described, as we can allow for negative 
losses (gains or profits), but the restriction to nonnegative losses is suitable for purely 
statistical applications. 

In a specific problem, the statistician chooses a loss function that is believed to 
lead to reasonable statistical procedures. This choice is dependent on the particular 
application. Consider some examples. 


EXAMPLE 8.4.2 (Example 8.4.1 continued) 
Perhaps a sensible choice in this problem would be 


1 6=1,a=0.6 
L(@,a)= 3} 2 6=2,a=0.5 
0 otherwise. 


Here we have decided that selecting a = 0.5 when it is not correct is a more serious 
error than selecting a = 0.6 when it is not correct. If we want to treat errors symmetri- 
cally, then we could take 


L(@, a) = [(a,0.6),2,0.5)}, a), 


i.e., the losses are 1 or 0. E 


EXAMPLE 8.4.3 Estimation as a Decision Problem 
Suppose we have a marginal parameter w (0) of interest, and we want to specify an 
estimate T (s) after observing s € S. Here, the action space is A = {y (0): 0 € Q} 
and A (0) = w(@). Naturally, we want T (s) € A. 

For example, suppose (x1, ...,Xn) is a sample from an N(u, 07) distribution, 
where (u, 07) € Q = R! x Rt is unknown, and we want to estimate y (u, 07) = u. In 
this case, A = R! and a possible estimator is the sample average T (x1, ..., Xn) =X. 


466 Section 8.4: Decision Theory (Advanced) 


There are many possible choices for the loss function. Perhaps a natural choice is 
to use 
L@,a) =|y@)—al, (8.4.1) 


the absolute deviation between w (0) and a. Alternatively, it is common to use 
LO, a) = (y@)—a)’, (8.4.2) 


the squared deviations between y(@) and a. 

We refer to (8.4.2) as squared error loss. Notice that (8.4.2) is just the square of 
the Euclidean distance between w (0) and a. It might seem more natural to actually use 
the distance (8.4.1) as the loss function. It turns out, however, that there are a number 
of mathematical conveniences that arise from using squared distance. E 


EXAMPLE 8.4.4 Hypothesis Testing as a Decision Problem 
In this problem, we have a characteristic of interest y (0) and want to assess the plau- 
sibility of the value yo after viewing the data s. In a hypothesis testing problem, this is 
written as Ho : w (0) = Wo versus Ha : y (0) Æ wo. As in Section 8.2, we refer to Ho 
as the null hypothesis and to H, as the alternative hypothesis. 

The purpose of a hypothesis testing procedure is to decide which of Hp or Ha is 
true based on the observed data s. So in this problem, the action space is A = { Ho, Ha} 
and the correct action function is 


Ha w0) # wo. 


An alternative, and useful, way of thinking of the two hypotheses is as subsets of 
Q. We write Ho = w~!{yo} as the subset of all 8 values that make the null hypothesis 
true, and Ha = Hj is the subset of all @ values that make the null hypothesis false. 
Then, based on the data s, we want to decide if the true value of 8 is in Hp or if 8 is in 
Ha. If Ho (or Ha) is composed of a single point, then it is called a simple hypothesis or 
a point hypothesis; otherwise, it is referred to as a composite hypothesis. 

For example, suppose that (x1, .. . , Xn) is a sample from an N (u, a?) distribution 
where 8 = (u, 07) € Q = R! x Rt, y (0) = u, and we want to test the null hypothesis 
Hy : u = mọ versus the alternative Ha : u # uo. Then Ho = {uo} x Rt and 
Ha = {uo}° x R*. For the same model, let 


A@) = | Ho y0) = Wo 


y (0) = Ieo, 49] x R+ (u, a°), 


i.e., y is the indicator function for the subset (—co, uo] x R*. Then testing Ho : y = 1 
versus the alternative Ha : y = 0 is equivalent to testing that the mean is less than or 
equal to uo versus the alternative that it is greater than xo. This one-sided hypothesis 
testing problem is often denoted as Hp : u < yuo versus Ha : u > Uo. 

There are a number of possible choices for the loss function, but the most com- 
monly used is of the form 


0 0 e Ho,a = Ho or 0 e Hy,a = Ha 
L(0,a)=4 b 0 ¢ Ho, a = Ho 
c 0 ¢ Hy,a = Hy. 


Chapter 8: Optimal Inference Methods 467 


If we reject Ho when Ho is true (a type I error), we incur a loss of c; if we accept Ho 
when Ho is false (a type II error), we incur a loss of b. When b = c, we can take 
b = c = 1 and produce the commonly used 0—/ loss function. B 


A statistician faced with a decision problem — i.e., a model, action space, correct 
action function, and loss function — must now select a rule for choosing an element of 
the action space A when the data s are observed. A decision function is a procedure 
that specifies how an action is to be selected in the action space A. 


Definition 8.4.2 A nonrandomized decision function d is a function d : S > A. 


So after observing s, we decide that the appropriate action is d(s). 
Actually, we will allow our decision procedures to be a little more general than this, 
as we permit a random choice of an action after observing s. 


Definition 8.4.3 A decision function ô is such that ô(s, -) is a probability measure 


on the action space A for each s € S (so ô(s, A) is the probability that the action 
taken is in A C A). 


Operationally, after observing s, a random mechanism with distribution specified by 
ô (s, -) is used to select the action from the set of possible actions. Notice that if ô (s, -) 
is a probability measure degenerate at the point d(s) (so d(s, {d(s)}) = 1) for each 
s, then 6 is equivalent to the nonrandomized decision function d and conversely (see 
Problem 8.4.8). 

The use of randomized decision procedures may seem rather unnatural, but, as 
we will see, sometimes they are an essential ingredient of decision theory. In many 
estimation problems, the use of randomized procedures provides no advantage, but this 
is not the case in hypothesis testing problems. We let D denote the set of all decision 
functions 6 for the specific problem of interest. 

The decision problem is to choose a decision function 6 € D. The selected 6 will 
then be used to generate decisions in applications. We base this choice on how the 
various decision functions ô perform with respect to the loss function. Intuitively, we 
want to choose 6 to make the loss as small as possible. For a particular 6, because 
s ~ fọ anda ~ 0(s,-), the loss L(@,a) is a random quantity. Therefore, rather 
than minimizing specific losses, we speak instead about minimizing some aspect of the 
distribution of the losses for each 0 € Q. Perhaps a reasonable choice is to minimize 
the average loss. Accordingly, we define the risk function associated with ô € D as 
the average loss incurred by ô. The risk function plays a central role in determining an 
appropriate decision function for a problem. 


Definition 8.4.4 The risk function associated with decision function 6 is given by 


R50) = Eo (Ess, (£ O, a))). (8.4.3) 


Notice that to calculate the risk function we first calculate the average of L (0, a), 
based on s fixed anda ~ 6(s, -). Then we average this conditional average with respect 
tos ~ f. By the theorem of total expectation, this is the average loss. When d(s, -) is 


468 Section 8.4: Decision Theory (Advanced) 


degenerate at d(s) for each s, then (8.4.3) simplifies (see Problem 8.4.8) to 
Rs (0) = Eo(L@, d (s))). 


Consider the following examples. 


EXAMPLE 8.4.5 
Suppose that S = {1, 2,3}, Q = {1, 2}, and the distributions are given by the following 
table. 


Fis) | 1/3 1/3 1/3 
fo(s) | 1/2 1/2 0 
Further suppose that A = Q, A (0) = 0, and the loss function is given by L (0, a) = 1 
when 0 Æ a but is 0 otherwise. 
Now consider the decision function 6 specified by the following table. 


d(1, {a}) 


ô(2, {a}) 
(3, {a}) 


So when we observe s = 1, we randomly choose the action a = 1 with probability 1/4 
and choose the action a = 2 with probability 3/4, etc. Notice that this decision function 
does the sensible thing and selects the decision a = 1 when we observe s = 3, as we 
know unequivocally that 0 = 1 in this case. 


We have 
1 3 
Esa, (L00, a)) = qZ@, 1)+ gle, 2) 
1 3 
Es LO) = 10,1) +5L@,2) 
Esg, (L(0,a)) = L(0,1), 


so the risk function of 6 is then given by 


Rs) = Ei (Exs,) 0, 4))) 
1/1 3 1/1 3 1 
ee ae 
= pO 
and 
R32) = En(Exs,) (L (2, 4))) 


1/1 3 1/1 3 
= 5 (GeO. 4+ 520.d) +5 (G2@,0+ 7222) +e D 


1 1 
= -+żz+0= 


1 
—. 
8 8 4 


Chapter 8: Optimal Inference Methods 469 


EXAMPLE 8.4.6 Estimation 
We will restrict our attention to nonrandomized decision functions and note that these 
are also called estimators. The risk function associated with estimator T and loss func- 
tion (8.4.1) is given by 

Rr@) = Eo (lw@) -TI 
and is called the mean absolute deviation (MAD). The risk function associated with 
the estimator T and loss function (8.4.2) is given by 


Rr(0) = Eo((w(@) — TY’) 


and is called the MSE. 

We want to choose the estimator T to minimize Rr (0) for every 0 € Q. Note that, 
when using (8.4.2), this decision problem is exactly the same as the optimal estimation 
problem discussed in Section 8.1. E 


EXAMPLE 8.4.7 Hypothesis Testing 

We note that for a given decision function 6 for this problem, and a data value s, 
the distribution ô (s, -) is characterized by ø (s) = 6(s, Ha), which is the probability 
of rejecting Ho when s has been observed. This is because the probability measure 
0(s, -) is concentrated on two points, so we need only give its value at one of these to 
completely specify it. We call g the test function associated with ô and observe that a 
decision function for this problem is also specified by a test function g. 

We have immediately that 


Exs,)(L@,a)) = (1 — o(s)) L@, Ho) + o (s)L (0, Ha). (8.4.4) 


Therefore, when using the 0-1 loss function, 


R50) = Eo (1 —9(s)) L@, Ho) + 9(9)L@, Ha)) 
= L(0, Ho) + Eo(g(s)) (L (0, Ha) — L(0, Ho)) 
Eo(g(s)) 0 € Ho 
7 Bu 0 € Ha. 


Recall that in Section 6.3.6, we introduced the power function associated with a 
hypothesis assessment procedure that rejected Ho whenever the P-value was smaller 
than some prescribed value. The power function, evaluated at 0, is the probability that 
such a procedure rejects Ho when @ is the true value. Because g(s) is the conditional 
probability, given s, that Ho is rejected, the theorem of total expectation implies that 
Eg((s)) equals the unconditional probability that we reject Ho when @ is the true 
value. So in general, we refer to the function 


By = Eo(os)) 


as the power function of the decision procedure ô or, equivalently, as the power function 
of the test function ø. 

Therefore, minimizing the risk function in this case is equivalent to choosing ø 
to minimize £,(@) for every 0 € Ho and to maximize f,(0) for every 0 € Ha. Ac- 
cordingly, this decision problem is exactly the same as the optimal inference problem 
discussed in Section 8.2. E 


470 Section 8.4: Decision Theory (Advanced) 


Once we have written down all the ingredients for a decision problem, it is then 
clear what form a solution to the problem will take. In particular, any decision function 
ôo that satisfies 

Ry (9) < Rs) 


for every 0 € Q and ô e D is an optimal decision function and is a solution. If 
two decision functions have the same risk functions, then, from the point of view of 
decision theory, they are equivalent. So it is conceivable that there might be more than 
one solution to a decision problem. 

Actually, it turns out that an optimal decision function exists only in extremely 
unrealistic cases, namely, the data always tell us categorically what the correct decision 
is (see Problem 8.4.9). We do not really need statistical inference for such situations. 
For example, suppose we have two coins — coin A has two heads and coin B has two 
tails. As soon as we observe an outcome from a coin toss, we know exactly which coin 
was tossed and there is no need for statistical inference. 

Still, we can identify some decision rules that we do not want to use. For example, 
if ô e D is such that there exists dg € D satisfying Rs, (0) < Rs(0) for every 0, and if 
there is at least one 0 for which R5,(@) < Rs(0), then naturally we strictly prefer do to 
ô. 


Definition 8.4.5 A decision function ô is said to be admissible if there is no ôo that 


is strictly preferred to it. 


A consequence of decision theory is that we should use only admissible decision 
functions. Still, there are many admissible decision functions and typically none is 
optimal. Furthermore, a procedure that is only admissible may be a very poor choice 
(see Challenge 8.4.11). 

There are several routes out of this impasse for decision theory. One approach is 
to use reduction principles. By this we mean that we look for an optimal decision 
function in some subclass Do C D that is considered appropriate. So we then look for 
a ôo € Do such that Rs, (0) < R5(@) for every 0 € Q and ô € Do, i.e., we look for an 
optimal decision function in Do. Consider the following example. 


EXAMPLE 8.4.8 Size a Tests for Hypothesis Testing 

Consider a hypothesis testing problem Hp versus Ha. Recall that in Section 8.2, we 
restricted attention to those test functions ø that satisfy Eg (g) < a for every 0 € Ho. 
Such a g is called a size a test function for this problem. So in this case, we are 
restricting to the class Do of all decision functions ô for this problem, which correspond 
to size a test functions. 

In Section 8.2, we showed that sometimes there is an optimal 6 € Do. For example, 
when Ho and H, are simple, the Neyman—Pearson theorem (Theorem 8.2.1) provides 
an optimal ø; thus, 6, defined by ô(s, Ha) = (s), is optimal. We also showed in 
Section 8.2, however, that in general there is no optimal size a test function g and so 
there is no optimal ô € Do. In this case, further reduction principles are necessary. E 


Another approach to selecting a 6 € D is based on choosing one particular real- 
valued characteristic of the risk function of 6 and ordering the decision functions based 
on that. There are several possibilities. 


Chapter 8: Optimal Inference Methods 471 


One way is to introduce a prior z into the problem and then look for the decision 
procedure ô € D that has smallest prior risk 


ró = Ex (Rs@)). 


We then look for a rule that has prior risk equal to mingep rs (or infsep rs). This ap- 
proach is called Bayesian decision theory. 


Definition 8.4.6 The quantity rg is called the prior risk of ô, mingep rs is called the 


Bayes risk, and a rule with prior risk equal to the Bayes risk is called a Bayes rule. 


We derived Bayes rules for several problems in Section 8.3. Interestingly, Bayesian 
decision theory always effectively produces an answer to a decision problem. This is a 
very desirable property for any theory of statistics. 

Another way to order decision functions uses the maximum (or supremum) risk. 
So for a decision function 6, we calculate 


Rs(0 
max (0) 


(or supgeo R5(@)) and then select a ô e D that minimizes this quantity. Such a ô has 
the smallest, largest risk or the smallest, worst behavior. 


Definition 8.4.7 A decision function ôo satisfying 


Ra (0) = min max Rs (0 
Ben Fao) = puns a 


is called a minimax decision function. 


Again, this approach will always effectively produce an answer to a decision problem 
(see Problem 8.4.10). 

Much more can be said about decision theory than this brief introduction to the 
basic concepts. Many interesting, general results have been established for the decision 
theoretic approach to statistical inference. 


Summary of Section 8.4 


e The decision theoretic approach to statistical inference introduces an action space 
A and a loss function L. 


e A decision function ô prescribes a probability distribution d(s,-) on A. The 
statistician generates a decision in A using this distribution after observing s. 


e The problem in decision theory is to select ô; for this, the risk function Rs(0) 
is used. The value Rs(0) is the average loss incurred when using the decision 
function 6, and the goal is to minimize risk. 


e Typically, no optimal decision function ô exists. So, to select a ĝ, various re- 
duction criteria are used to reduce the class of possible decision functions, or the 
decision functions are ordered using some real-valued characteristic of their risk 
functions, e.g., maximum risk or average risk with respect to some prior. 


472 Section 8.4: Decision Theory (Advanced) 


EXERCISES 


8.4.1 Suppose we observe a sample (x1,...,Xn) from a Bernoulli(@@) distribution, 
where @ is completely unknown, and we want to estimate 0 using squared error loss. 
Write out all the ingredients of this decision problem. Calculate the risk function of the 


estimator T (x1, ...,%n) = X. Graph the risk function when n = 10. 

8.4.2 Suppose we have a sample (x1, ... , Xn) from a Poisson(A) distribution, where 4 

is completely unknown, and we want to estimate À using squared error loss. Write out 

all the ingredients of this decision problem. Consider the estimator T(x1,...,Xn) = X 

and calculate its risk function. Graph the risk function when n = 25. 

8.4.3 Suppose we have a sample (x1, .. . , Xn) from an N (u, o?) distribution, where u 

is unknown and o? is known, and we want to estimate u using squared error loss. Write 

out all the ingredients of this decision problem. Consider the estimator T (x1, .. . , Xn) = 


x and calculate its risk function. Graph the risk function when n = 25, o? = 2; 

8.4.4 Suppose we observe a sample (x1, ...,Xn) from a Bernoulli (0) distribution, 
where @ is completely unknown, and we want to test the null hypothesis that 0 = 1/2 
versus the alternative that it is not equal to this quantity, and we use 0-1 loss. Write 
out all the ingredients of this decision problem. Suppose we reject the null hypothesis 
whenever we observe nx e {0,1,n —1,n}. Determine the form of the test function 
and its associated power function. Graph the power function when n = 10. 

8.4.5 Consider the decision problem with sample space S = {1,2,3,4}, parameter 
space Q = {a, b}, with the parameter indexing the distributions given in the following 
table. 


Loar! 

fals) | 1/4 1/4 0 1/2 

Js) | 12 0 1/4 1/4 
Suppose that the action space A = Q, with A(@) = 0, and the loss function is given 
by L(0,a) = 1 when a 4 4(0) and is equal to 0 otherwise. 
(a) Calculate the risk function of the deterministic decision function given by d(1) = 
d(2) =d(3) =a and d(4) =b. 
(b) Is d in part (a) optimal? 


COMPUTER EXERCISES 


8.4.6 Suppose we have a sample (x;,...,%,) from a Poisson(A) distribution, where 
A is completely unknown, and we want to test the hypothesis that A < Ao versus the 
alternative that 2 > do, using the 0-1 loss function. Write out all the ingredients 
of this decision problem. Suppose we decide to reject the null hypothesis whenever 
nx > |ndo +2./ndo| and randomly reject the null hypothesis with probability 1/2 
when nx = |ndo+ 2J/nio| . Determine the form of the test function and its associated 
power function. Graph the power function when 2g = 1 and n = 5. 

8.4.7 Suppose we have a sample (x1,...,x,) from an N (u, o?) distribution, where 
4 is unknown and o? is known, and we want to test the null hypothesis that the mean 
response is xo versus the alternative that the mean response is not equal to wo, using 
the 0-1 loss function. Write out all the ingredients of this decision problem. Suppose 


Chapter 8: Optimal Inference Methods 473 


that we decide to reject whenever x ¢ [uo — 200/./”, Ho + 200/,/n]. Determine the 
form of the test function and its associated power function. Graph the power function 
when uo = 0, co = 3, and n = 10. 


PROBLEMS 


8.4.8 Prove that a decision function 6 that gives a probability measure d(s, -) degen- 
erate at d(s) for each s € S is equivalent to specifying a function d : S —> A and 
conversely. For such a ô, prove that R5(@) = Eg (L(@, d(s))). 

8.4.9 Suppose we have a decision problem and that each probability distribution in the 
model is discrete. 

(a) Prove that ô is optimal in D if and only if ô (s, -) is degenerate at A(@) for each s 
for which Pg ({s}) > 0. 

(b) Prove that if there exist 01,82 € Q such that A (601) 4 A(62), and Po,, Po, are not 
concentrated on disjoint sets, then there is no optimal ô € D. 

8.4.10 If decision function 6 has constant risk and is admissible, then prove that 6 is 
minimax. 


CHALLENGES 


8.4.11 Suppose we have a decision problem in which ĝo € Q is such that Py, (C) = 0 
implies that P9(C) = 0 for every 9 e Q. Further assume that there is no optimal 
decision function (see Problem 8.4.9). Then prove that the nonrandomized decision 
function d given by d(s) = A(@o) is admissible. What does this result tell you about 
the concept of admissibility? 


DISCUSSION TOPICS 


8.4.12 Comment on the following statement: A natural requirement for any theory of 
inference is that it produce an answer for every inference problem posed. Have we 
discussed any theories so far that you believe will satisfy this? 


8.4.13 Decision theory produces a decision in a given problem. It says nothing about 
how likely it is that the decision is in error. Some statisticians argue that a valid ap- 
proach to inference must include some quantification of our uncertainty concerning any 
statement we make about an unknown, as only then can a recipient judge the reliability 
of the inference. Comment on this. 


8.5 | Further Proofs (Advanced) 
Proof of Theorem 8.1.2 


We want to show that a statistic U is sufficient for a model if and only if the conditional 
distribution of the data s given U = u is the same for every 0 € Q. 

We prove this in the discrete case so that fo(s) = Pg({s}). The general case re- 
quires more mathematics, and we leave that to a further course. 


474 Section 8.5: Further Proofs (Advanced) 
Let u be such that Pg(U~!{u}) > 0 where U~!{u} = {s : U(s) = u}, so U7! {u} 
is the set of values of s such that U(s) = u. We have 


Pals =s U =u) = PE (8.5.1) 


Whenever sı ¢ U~!{u}, 
Po(s =s1, U =u) = Po({s1} N {s : Us) =u}) = Poh) = 0 


independently of 0. Therefore, Po (s = sı |U =u) = 0 independently of 0. 
So let us suppose that sı € UT! {u}. Then 


Po(s = 81,U =u) = Pa({s1} N {s : Us) = u}) = Pası) = fos). 


If U is a sufficient statistic, the factorization theorem (Theorem 6.1.1) implies f(s) = 
h(s)ge(U(s)) for some h and g. Therefore, since 


P(U=u)= > f6), 


seU—|{u} 
(8.5.1) equals 


TONEY Spel ON) 2 wo = 

Dseu-u} F665) Lseu-uy Cs 81) folS1)— Dseu-uy CS, 51) 
where 

fos) _ hs) 

Jo(si) Alsi) 


We conclude that (8.5.1) is independent of 0. 
Conversely, if (8.5.1) is independent of 8, then for s1, s2 € U~!{u} we have 


Po(s = 52) 


= c(s, s1). 


P(U =u) = eee 
Thus 
HE = Bo =s) = BG =s |U = Pe =u) 
= Pls = 51/0 =p a 
7 a fol = ¢(51, $2) fo ($2), 
where 


Po(s = 5, |U =u) 
Po(s =s2|U =u) 
By the definition of sufficiency in Section 6.1.1, this establishes the sufficiency of U. B 


c(s1, 52) = 


Chapter 8: Optimal Inference Methods 475 


Establishing the Completeness of x in Example 8.1.3 


Suppose that (x1, ...,Xn) is a sample from an N (u, o?) distribution, where u € R! 
is unknown and o? > 0 is known. In Example 6.1.7, we showed that x is a minimal 
sufficient statistic. 
Suppose that the function A is such that E,,(h(x)) = 0 for every u € R!. Then 
defining 
h* (x) = max (0, A(x)) and h~ (x) = max (0, —h(%)), 


we have h(x) = h+ (x) — h- (x). Therefore, setting 
ct (u) = Ep (ht) and c7 (u) = E, (4-H), 
we must have 
E,((X)) = E Ht) = E T) = +H) = cw) = 0, 
and so ct() = c7 (u). Because ht and hT are nonnegative functions, we have that 
ct(u) > Oand c7 (u) > 0. 

If ct(u) = 0, then we have that h*+(x) = 0 with probability 1, because a non- 
negative function has mean 0 if and only if it is 0 with probability 1 (see Challenge 
3.3.22). Then h~ (x) = 0 with probability 1 also, and we conclude that h(x) = 0 with 
probability 1. 

If ct(ug) > 0, then h*+(x) > 0 for all x in a set A having positive probability 
with respect to the N (uo, o? /n) distribution (otherwise A+ (x) = 0 with probability 1, 
which implies, as above, that c+ (uo) = 0). This implies that ct(u) > 0 for every u 
because every N (u, o? /n) distribution assigns positive probability to A as well (you 
can think of A as a subinterval of R'). 

Now note that 

1 
V2 00 


is nonnegative and is strictly positive on A. We can write 


gt (x) =At (x) exp(—nX?/205) 


p 00 1 
ct (u) = E ht () = f MO n E- 1)? 209) di 
= exp(—n u’ /203) F exp(n ux /od)g* (x) dk. (8.5.2) 


Setting u = 0 establishes that 0 < f°. gt (%) dx < 00, because 0 < ct (u) < 00 for 
every u. Therefore, 
gt) 
oa) = 
Joo Bt &) dx 
is a probability density of a distribution concentrated on A+ = {x : A(X) > 0}. Fur- 
thermore, using (8.5.2) and the definition of moment-generating function in Section 
3.4, 
ct (u) exp(np? /205) 


Tatas (8.5.3) 


476 Section 8.5: Further Proofs (Advanced) 


is the moment-generating function of this distribution evaluated at n u / ag: 
Similarly, we define 


1 
g @&)= En 


exp(—nx* /202) 


so that g 
g@) 
SS 87 E) dx 
is a probability density of a distribution concentrated on AT = {x : h (x) < 0}. Also, 


c7 (u) exp(np?/20) 
JSS g E) dx 


is the moment-generating function of this distribution evaluated at n u / oe 
Because ct (u) = c~ (u), we have that (setting u = 0) 


[ee dix = [Da 


This implies that (8.5.3) equals (8.5.4) for every u, and so the moment-generating 
functions of these two distributions are the same everywhere. By Theorem 3.4.6, these 
distributions must be the same. But this is impossible, as the distribution given by g* 
is concentrated on At whereas the distribution given by g7 is concentrated on A~ and 
A*t N AT = ¢. Accordingly, we conclude that we cannot have ct (yu) > 0, and we are 
done. 


(8.5.4) 


The Proof of Theorem 8.2.1 (the Neyman—Pearson Theorem) 


We want to prove that when QO = {00,01}, and we want to test Hy : 0 = Qo, then an 
exact size a test function go exists of the form 


1 So, (S)/foo(s) > co 
go(s)=4 7 fo, (s)/foo(s) = co (8.5.5) 
0 So, (S)/foo(s) < co 


for some y € [0, 1] and co > 0, and this test is UMP size a. 

We develop the proof of this result in the discrete case. The proof in the more 
general context is similar. 

First, we note that {s : fo(s) = fo,(s) = 0} has Py measure equal to 0 for 
both 6 = 0o and 0 = @,. Accordingly, without loss we can remove this set from the 
sample space and assume hereafter that fo, (s) and fg, (s) cannot be simultaneously 0. 
Therefore, the ratio fg, (s)/fo, (s) is always defined. 

Suppose that a = 1. Then setting c = 0 and y = 1 in (8.5.5), we see that go(s) = 
1, and so Eg, (pọ) = 1. Therefore, pọ is UMP size a, because no test can have power 
greater than 1. 


Chapter 8: Optimal Inference Methods 477 


Suppose that a = 0. Setting co = oo and y = 1 in (8.5.5), we see that go(s) = 0 if 
and only if fg,(s) > 0 (if fo, (s) = 9, then fg, (s)/fo,(s) = œ and conversely). So gq 
is the indicator function for the set A = {s : fg,(s) = 0}, and therefore Eg, (go) = 0. 
Further, any size 0 test function g must be 0 on A° to have Eg, (9) = 0. On A we have 
that 0 < g(s) < 1 = @o(s) and so Eg, (p) < Eo, (Yo). Therefore, gp is UMP size a. 

Now assume that 0 < a < 1. Consider the distribution function of the likelihood 
ratio when 0 = @o, namely, 


1 — a* (c) = Poo (Jo, )/fao(s) < ©). 


So 1—a* (c) is a nondecreasing function of c with 1—a*(—oo) = 0 and 1—a*(co) = 1. 

Let co be the smallest value of c such that 1 — a < 1 — a* (c) (recall that 1 — a* (c) 
is right continuous because it is a distribution function). Then we have that 1 — a * (co — 
0) = 1 —lim,\o a* (co — €) < 1 — a < 1 — a* (co) and (using the fact that the jump 
in a distribution function at a point equals the probability of the point) 


Poo (fo, (8)/ foo (8S) = co) 


(1 — a*(co)) — (1 — a* (co — 0)) 
= a*(co — 0) — a* (co). 
Using this value of co in (8.5.5), put 
—a-a*(co) g* * 
Fea) % (co — 0) + a* (co) 
y = 
0 otherwise, 


and note that y € [0, 1]. Then we have 
Eo .(0) = y Poa (Joi (8)/fao(s) = co) + Poo (fa S)/fao(s) > co) 
= a—a*(co)+a*(co) =a, 


SO Yo has exact size a. 
Now suppose that g is another size a test and Eg,(g) > Eo, (po). We partition the 
sample space as S = So U S1 U S2 where 


So = {s: Qos) — g(s) = 9}, 
Sı = {s: ols) — (s) < 0}, 
Sy = {8 : go(s) — p(s) > 0}. 


Note that 
Si = {s : gols) — ols) < 0, fo, (S)/fao(s) < co} 


because fg, (s)/fo,(S) > co implies gp(s) = 1, which implies go(s) — gs) = 1 — 
o(s) > Qas 0 < g(s) < 1. Also 


S2 = {s : gols) — (s) > 0, foi (s)/ foo (S) = co} 


because fg, (s)/ foa (s) < co implies gg (s) = 0, which implies go (s)—ọ (s) = —ọ (s) < 
Oas0 < g(s) < 1. 


478 Section 8.5: Further Proofs (Advanced) 


Therefore, 


0 Eo, (p0) — Eo, (p) = Eo, (p0 — ) 


Eo, (Is; (s) (ols) — ø (s))) + Ea, Zs, (8) ols) — 9 (s))). 


Il Iv 


Now note that 


Eo, Us, (s)@ols) — 9) = >) Go) - 96)) fa ©) 


sesi 


> co X (Gols) — 9(s)) foo (5) = CoE (s: (8 (pols) — p (8))) 


sesi 


because go(s) — ø (s) < 0 and fo, (s)/fo,(s) < co when s €e Sj. Similarly, we have 
that 


Eg, s60) — 9(8))) = X @ols) — 9) fa) 


seSo 


> co > (po(s) = 9(8)) fay (5) = co Eo Us (s)(po(s) — 9(8))) 


ses? 


because o(s) — ø (s) > O and fo, (s)/foo (S) > co when s € Sp. 
Combining these inequalities, we obtain 


0 > Eo, (p0) — Fa, GY) > coEg, (p0 — 9) 
= co (Ea, (p0) — Eo, (p)) = cola — Eo, (g)) = 0 


because Eg,(g) < 0. Therefore, Eg,(¢9) = Eo,(g), which proves that gq is UMP 
among all size a tests. E 


Chapter 9 
Model Checking 


CHAPTER OUTLINE 


Section 1 Checking the Sampling Model 
Section 2 Checking for Prior—Data Conflict 
Section 3 The Problem with Multiple Checks 


The statistical inference methods developed in Chapters 6 through 8 all depend on 
various assumptions. For example, in Chapter 6 we assumed that the data s were 
generated from a distribution in the statistical model {Pg : 0 € Q}. In Chapter 7, we 
also assumed that our uncertainty concerning the true value of the model parameter 0 
could be described by a prior probability distribution IT. As such, any inferences drawn 
are of questionable validity if these assumptions do not make sense in a particular 
application. 

In fact, all statistical methodology is based on assumptions or choices made by 
the statistical analyst, and these must be checked if we want to feel confident that 
our inferences are relevant. We refer to the process of checking these assumptions as 
model checking, the topic of this chapter. Obviously, this is of enormous importance 
in applications of statistics, and good statistical practice demands that effective model 
checking be carried out. Methods range from fairly informal graphical methods to 
more elaborate hypothesis assessment, and we will discuss a number of these. 


9.1 | Checking the Sampling Model 


Frequency-based inference methods start with a statistical model {fg : 0 € Q}, for the 
true distribution that generated the data s. This means we are assuming that the true 
distribution for the observed data is in this set. If this assumption is not true, then 
it seems reasonable to question the relevance of any subsequent inferences we make 
about 0. 

Except in relatively rare circumstances, we can never know categorically that a 
model is correct. The most we can hope for is that we can assess whether or not the 
observed data s could plausibly have arisen from the model. 


479 


480 Section 9.1: Checking the Sampling Model 


If the observed data are surprising for each distribution in the model, then we have 
evidence that the model is incorrect. This leads us to think in terms of computing a 
P-value to check the correctness of the model. Of course, in this situation the null 
hypothesis is that the model is correct; the alternative is that the model could be any of 
the other possible models for the type of data we are dealing with. 

We recall now our discussion of P-values in Chapter 6, where we distinguished 
between practical significance and statistical significance. It was noted that, while a P- 
value may indicate that a null hypothesis is false, in practical terms the deviation from 
the null hypothesis may be so small as to be immaterial for the application. When the 
sample size gets large, it is inevitable that any reasonable approach via P-values will 
detect such a deviation and indicate that the null hypothesis is false. This is also true 
when we are carrying out model checking using P-values. The resolution of this is to 
estimate, in some fashion, the size of the deviation of the model from correctness, and 
so determine whether or not the model will be adequate for the application. Even if 
we ultimately accept the use of the model, it is still valuable to know, however, that we 
have detected evidence of model incorrectness when this is the case. 

One P-value approach to model checking entails specifying a discrepancy statistic 
D : S — R! that measures deviations from the model under consideration. Typically, 
large values of D are meant to indicate that a deviation has occurred. The actual value 
D(s) is, of course, not necessarily an indication of this. The relevant issue is whether 
or not the observed value D(s) is surprising under the assumption that the model is cor- 
rect. Therefore, we must assess whether or not D(s) lies in a region of low probability 
for the distribution of this quantity when the model is correct. For example, consider 
the density of a potential D statistic plotted in Figure 9.1.1. Here a value D(s) in the 
left tail (near 0), right tail (out past 15), or between the two modes (in the interval from 
about 7 to 9) all would indicate that the model is incorrect, because such values have a 
low probability of occurrence when the model is correct. 


0.3 


0.2 


0.1 


Figure 9.1.1: Plot of a density for a discrepancy statistic D. 


Chapter 9: Model Checking 481 


The above discussion places the restriction that, when the model is correct, D must 
have a single distribution, i.e., the distribution cannot depend on 0. For many com- 
monly used discrepancy statistics, this distribution is unimodal. A value in the right 
tail then indicates a lack of fit, or underfitting, by the model (the discrepancies are 
unnaturally large); a value in the left tail then indicates overfitting by the model (the 
discrepancies are unnaturally small). 

There are two general methods available for obtaining a single distribution for the 
computation of P-values. One method requires that D be ancillary. 


Definition 9.1.1 A statistic D whose distribution under the model does not depend 
upon @ is called ancillary, i.e., ifs ~ Pg, then D(s) has the same distribution for 


every 0 € Q. 


If D is ancillary, then it has a single distribution specified by the model. If D(s) is a 
surprising value for this distribution, then we have evidence against the model being 
true. 

It is not the case that any ancillary D will serve as a useful discrepancy statistic. 
For example, if D is a constant, then it is ancillary, but it is obviously not useful for 
model checking. So we have to be careful in choosing D. 

Quite often we can find useful ancillary statistics for a model by looking at resid- 
uals. Loosely speaking, residuals are based on the information in the data that is left 
over after we have fit the model. If we have used all the relevant information in the data 
for fitting, then the residuals should contain no useful information for inference about 
the parameter 0. Example 9.1.1 will illustrate more clearly what we mean by residuals. 
Residuals play a major role in model checking. 

The second method works with any discrepancy statistic D. For this, we use the 
conditional distribution of D, given the value of a sufficient statistic T. By Theorem 
8.1.2, this conditional distribution is the same for every value of 9. If D (s) is a surpris- 
ing value for this distribution, then we have evidence against the model being true. 

Sometimes the two approaches we have just described agree, but not always. Con- 
sider some examples. 


EXAMPLE 9.1.1 Location Normal 
Suppose we assume that (x1, .. . , Xn) is a sample from an N (u, o?) distribution, where 
u € R! is unknown and o? is known. We know that x is a minimal sufficient statistic 
for this problem (see Example 6.1.7). Also, x represents the fitting of the model to the 
data, as it is the estimate of the unknown parameter value u. 

Now consider 


r=r(xq1,...,Xn) =(W1,.---5%n) = 1 —-X,...,Xn —X) 


as one possible definition of the residual. Note that we can reconstruct the original data 
from the values of x andr. 

It turns out that R = (X1 —X,..., Xa, —X) has a distribution that is independent of 
u, with E(R;) = 0 and Cov(R;, Rj) = 05 (6;; — 1/n) for every i, j (6;; = 1 when i = 
j and 0 otherwise). Moreover, R is independent of X and R; ~ N(0, o? (1 — 1/n)) 
(see Problems 9.1.19 and 9.1.20). 


482 Section 9.1: Checking the Sampling Model 


Accordingly, we have that r is ancillary and so is any discrepancy statistic D that 
depends on the data only through r. Furthermore, the conditional distribution of D(R) 
given Y = x is the same as the marginal distribution of D(R) because they are inde- 
pendent. Therefore, the two approaches to obtaining a P-value agree here, whenever 
the discrepancy statistic depends on the data only through r. 

By Theorem 4.6.6, we have that 


ine ee : 
D(R) = > RP = (X -5 
i= 90 F= 


is distributed y*(n — 1), so this is a possible discrepancy statistic. Therefore, the P- 
value 
P(D > D(r)), (9.1.1) 


where D ~ y?(n — 1), provides an assessment of whether or not the model is correct. 

Note that values of (9.1.1) near 0 or near | are both evidence against the model, as 
both indicate that D(r) is in a region of low probability when assuming the model is 
correct. A value near 0 indicates that D(r) is in the right tail, whereas a value near 1 
indicates that D(r) is in the left tail. 

The necessity of examining the left tail of the distribution of D(r), as well as the 
right, is seen as follows. Consider the situation where we are in fact sampling from an 
N (u, o?) distribution where g? is much smaller than Ta: In this case, we expect D(r) 
to be a value in the left tail, because E(D(R)) = (n — 1)07/o3. 

There are obviously many other choices that could be made for the D statistic. 
At present, there is not a theory that prescribes one choice over another. One caution 
should be noted, however. The choice of a statistic D cannot be based upon looking at 
the data first. Doing so invalidates the computation of the P-value as described above, 
as then we must condition on the data feature that led us to choose that particular D. E 


EXAMPLE 9.1.2 Location-Scale Normal 

Suppose we assume that (x1, . . . , Xn) is a sample from an N (u, a°) distribution, where 
(u,o7) € R! x (0,00) is unknown. We know that (¥,s°) is a minimal sufficient 
statistic for this model (Example 6.1.8). Consider 


xı —-X Xn X 
e = sate) = ( te ) 


as one possible definition of the residual. Note that we can reconstruct the data from 
the values of (x, s?) andr. 

It turns out R has a distribution that is independent of (u, c?) (and hence is an- 
cillary — see Challenge 9.1.28) as well as independent of (Y, S). So again, the two 
approaches to obtaining a P-value agree here, as long as the discrepancy statistic de- 
pends on the data only through r. 

One possible discrepancy statistic is given by 


2 
i 


1 n 
Dr) = -a a(i :) 
i=l 


Chapter 9: Model Checking 483 


To use this statistic for model checking, we need to obtain its distribution when the 
model is correct. Then we compare the observed value D(r) with this distribution, to 
see if it is surprising. 

We can do this via simulation. Because the distribution of D(R) is independent 
of (u, 07), we can generate N samples of size n from the N(0, 1) distribution (or 
any other normal distribution) and calculate D(R) for each sample. Then we look 
at histograms of the simulated values to see if D(r), from the original sample, is a 
surprising value, i.e., if it lies in a region of low probability like a left or right tail. 

For example, suppose we observed the sample 


—2.08 —0.28 2.01 —1.37 40.08 


obtaining the value D(r) = 4.93. Then, simulating 104 values from the distribution 
of D, under the assumption of model correctness, we obtained the density histogram 
given in Figure 9.1.2. See Appendix B for some code used to carry out this simulation. 
The value D(r) = 4.93 is out in the right tail and thus indicates that the sample is not 
from a normal distribution. In fact, only 0.0057 of the simulated values are larger, so 
this is definite evidence against the model being correct. 


0.8 — 


07 — H 


06 — 


0.5 = 


04 — 


density 


0.3 — 


0.2 — 


w iig 
T T T T T T T 


1 2 3 4 5 6 T 
D 


Figure 9.1.2: A density histogram for a simulation of 104 values of D in Example 9.1.2. 


Obviously, there are other possible functions of r that we could use for model 
checking here. In particular, Dskew(r) = (n — 1)7?/2 F r°, the skewness statis- 
tic, and Dkurtosis r) = (n — 1)7? Dri rf, the kurtosis statistic, are commonly used. 
The skewness statistic measures the symmetry in the data, while the kurtosis statistic 
measures the “peakedness” in the data. As just described, we can simulate the distribu- 
tion of these statistics under the normality assumption and then compare the observed 
values with these distributions to see if we have any evidence against the model (see 
Computer Problem 9.1.27). E 


The following examples present contexts in which the two approaches to computing 
a P-value for model checking are not the same. 


484 Section 9.1: Checking the Sampling Model 


EXAMPLE 9.1.3 Location-Scale Cauchy 

Suppose we assume that (x1,..., 2) is a sample from the distribution given by u + 
oZ, where Z ~ t(1) and (u,o7) € R! x (0, 00) is unknown. This time, (¥, s?) is 
not a minimal sufficient statistic, but the statistic r defined in Example 9.1.2 is still 
ancillary (Challenge 9.1.28). We can again simulate values from the distribution of R 
(just generate samples from the ¢(1) distribution and compute r for each sample) to 
estimate P-values for any discrepancy statistic such as the D(r) statistics discussed in 
Example 9.1.2. E 


EXAMPLE 9.1.4 Fisher's Exact Test 

Suppose we take a sample of n from a population of students and observe the values 
(a1, b1),..., (Gn, bn), where a; is gender (A = 1 indicating male, A = 2 indicating 
female) and b; is a categorical variable for part-time employment status (B = 1 indicat- 
ing employed, B = 2 indicating unemployed). So each individual is being categorized 
into one of four categories, namely, 


Category 1, when 4 = 1, B = 1, 
Category 2, when 4 = 1, B =2, 
Category 3, when 4 =2, B =1, 
Category 4, when 4 = 2, B =2. 
Suppose our model for this situation is that 4 and B are independent with P(A = 
1) = a1, P(B = 1) = f, where a; e [0, 1] and £; € [0, 1] are completely unknown. 


Then letting X;; denote the count for the category, where A = i, B = j, Example 2.8.5 
gives that 


(X11, X12, X21, X22) ~ Multinomial(n, a1 8), 4182, 428), 4282). 


As we will see in Chapter 10, this model is equivalent to saying that there is no rela- 
tionship between gender and employment status. 

Denoting the observed cell counts by (x11, X12, X21, x22), the likelihood function is 
given by 


(018 1)" (a1 Boy"? (4281)! (a2B2y 


= ie: a = aq)” pore a py)? 2 
= a} = any" BEd = By, 


where (x1.,x.1) = (x11 +%12,%11 +21). Therefore, the MLE (Problem 9.1.14) is 


given by 
Ais 3% XI. X1 
(a1, D J (=, =) . 
n n 
Note that &ı is the proportion of males in the sample and lA is the proportion of all 
employed in the sample. Because (x1., x.1) determines the likelihood function and can 
be calculated from the likelihood function, we have that (x1., x.1) is a minimal sufficient 
statistic. 
In this example, a natural definition of residual does not seem readily apparent. 
So we consider looking at the conditional distribution of the data, given the minimal 


Chapter 9: Model Checking 485 


sufficient statistic. The conditional distribution of the sample (41, B1),..., (An, Bn), 
given the values (x1., x.1), is the uniform distribution on the set of all samples where 
the restrictions 


xy +x = Xb, 
Xip+tx2 = X], 
XU +X1I2+X2+x22 = N (9.1.2) 


are satisfied. Notice that, given (x1., x.1), all the other values in (9.1.2) are determined 
when we specify a value for x11. 
It can be shown that the number of such samples is equal to (see Problem 9.1.21) 


69169) 


Now the number of samples with prescribed values for x1., x.1, and x11 = į is given by 


OEE) 


Therefore, the conditional probability function of x11, given (x1., x.1), is 


POY HTX, x1) = GIG) ) = DE 


Ga) Ga) G) 


This is the Hypergeometric(n, x.1, x1.) probability function. 

So we have evidence against the model holding whenever x11 is out in the tails of 
this distribution. Assessing this requires a tabulation of this distribution or the use of a 
statistical package with the hypergeometric distribution function built in. 

As a simple numerical example, suppose that we took a sample of n = 20 students, 
obtaining x.) = 12 unemployed, xı. = 6 males, and xj; = 2 employed males. Then 
the Hypergeometric(20, 12, 6) probability function is given by the following table. 


PS a a a ae a 
[p6 [0.001 0.017 0.119 0318 0.358 0.163 0.024 


The probability of getting a value as far, or farther, out in the tails than x); = 2 is equal 
to the probability of observing a value of x1; with probability of occurrence as small 
as or smaller than x1; = 2. This P-value equals 


(0.119 + 0.017 + 0.001) + 0.024 = 0.161. 


Therefore, we have no evidence against the model of independence between A and B. 
Of course, the sample size is quite small here. 

There is another approach here to testing the independence of A and B. In particu- 
lar, we could only assume the independence of the initial unclassified sample, and then 
we always have 


(X11, X12, X21, X22) ~ Multinomial(”, a11, @12, 421, 422), 


486 Section 9.1: Checking the Sampling Model 


where the a;; comprise an unknown probability distribution. Given this model, we 
could then test for the independence of A and B. We will discuss this in Section 10.2. E 


Another approach to model checking proceeds as follows. We enlarge the model to 
include more distributions and then test the null hypothesis that the true model is the 
submodel we initially started with. If we can apply the methods of Section 8.2 to come 
up with a uniformly most powerful (UMP) test of this null hypothesis, then we will 
have a check of departures from the model of interest — at least as expressed by the 
possible alternatives in the enlarged model. If the model passes such a check, however, 
we are still required to check the validity of the enlarged model. This can be viewed as 
a technique for generating relevant discrepancy statistics D. 


9.1.1 | Residual and Probability Plots 


There is another, more informal approach to checking model correctness that is often 
used when we have residuals available. These methods involve various plots of the 
residuals that should exhibit specific characteristics if the model is correct. While this 
approach lacks the rigor of the P-value approach, it is good at demonstrating gross 
deviations from model assumptions. We illustrate this via some examples. 


EXAMPLE 9.1.5 Location and Location-Scale Normal Models 

Using the residuals for the location normal model discussed in Example 9.1.1, we have 
that E(R;) = 0 and Var(R;) = on(1 — 1/n). We standardize these values so that they 
also have variance 1, and so obtain the standardized residuals (r/, ..., rž) given by 


aon) a a 
r= Pea ¥). (9.1.3) 


The standardized residuals are distributed N (0, 1), and, assuming that n is reasonably 
large, it can be shown that they are approximately independent. Accordingly, we can 
think ofr}, ...,77 as an approximate sample from the N(0, 1) distribution. 

Therefore, a plot of the points (i,7;‘) should not exhibit any discernible pattern. 
Furthermore, all the values in the y-direction should lie in (—3, 3), unless of course 
n is very large, in which case we might expect a few values outside this interval. A 
discernible pattern, or several extreme values, can be taken as some evidence that the 
model assumption is not correct. Always keep in mind, however, that any observed 
pattern could have arisen simply from sampling variability when the true model is 
correct. Simulating a few of these residual plots (just generating several samples of n 
from the N(0, 1) distribution and obtaining a residual plot for each sample) will give 
us some idea of whether or not the observed pattern is unusual. 

Figure 9.1.3 shows a plot of the standardized residuals (9.1.3) for a sample of 100 
from the N (0, 1) distribution. Figure 9.1.4 shows a plot of the standardized residuals 
for a sample of 100 from the distribution given by 3~!/Z, where Z ~ t (3). Note that 
a t (3) distribution has mean 0 and variance equal to 3, so Var(3~!/2Z) = 1 (Problem 
4.6.16). Figure 9.1.5 shows the standardized residuals for a sample of 100 from an 
Exponential(1) distribution. 


Chapter 9: Model Checking 


standardized residual 


487 


Figure 9.1.3: A plot of the standardized residuals for a sample of 100 from an N (0, 1) 


distribution. 


(ae 
5 4 
E e 
© 
3 37 
ke] 
@ 24 © 
e; 
~ e. 
ke] 
8 o 4 . nee h 
i of o . 
pated 
° 

© 
o 2m 
c 
8 3 
w a 

5 — 

6- 


T 
100 


Figure 9.1.4: A plot of the standardized residuals for a sample of 100 from X = 37!/2Z 


where Z ~ t(3). 


aD 
eel 


standardized residual 
Lbobibiooanow 
[safle T E “| 


bo 
(bel 


e 
e 
pe wo TA A 


e 


Figure 9.1.5: A plot of the standardized residuals for a sample of 100 from an Exponential (1) 


distribution. 


488 Section 9.1: Checking the Sampling Model 


Note that the distributions of the standardized residuals for all these samples have 
mean 0 and variance equal to 1. The difference in Figures 9.1.3 and 9.1.4 is due to the 
fact that the ¢ distribution has much longer tails. This is reflected in the fact that a few 
of the standardized residuals are outside (—3, 3) in Figure 9.1.4 but not in Figure 9.1.3. 
Even though the two distributions are quite different — e.g., the N(0, 1) distribution 
has all of its moments whereas the 3~!/2 ¢(3) distribution has only two moments — 
the plots of the standardized residuals are otherwise very similar. The difference in 
Figures 9.1.3 and 9.1.5 is due to the asymmetry in the Exponential (1) distribution, as 
it is skewed to the right. 

Using the residuals for the location-scale normal model discussed in Example 9.1.2, 
we define the standardized residuals rj, ...,77 by 


ae ee ee Oe 
a ra, x). (9.1.4) 


Here, the unknown variance is estimated by s*. Again, it can be shown that when n is 
large, then (rj, ..., 7,7) is an approximate sample from the N (0, 1) distribution. So we 
plot the values (i, rž) and interpret the plot just as we described for the location normal 
model. E 


It is very common in statistical applications to assume some basic form for the dis- 
tribution of the data, e.g., we might assume we are sampling from a normal distribution 
with some mean and variance. To assess such an assumption, the use of a probability 
plot has proven to be very useful. 

To illustrate, suppose that (x1, .. . , Xn) is a sample from an N (4, o?) distribution. 
Then it can be shown that when n is large, the expectation of the i-th order statistic 
satisfies 

E(X) ¥ u +007! (i/(n+1)). (9.1.5) 
If the data value x; corresponds to order statistic xg) (i.e., xa) = xj), then we call 
©! (i/ (n + 1)) the normal score of x j in the sample. Then (9.1.5) indicates that if 
we plot the points (x(j), ®-!(i/(n + 1))), these should lie approximately on a line 
with intercept u and slope ø. We call such a plot a normal probability plot or normal 
quantile plot. Similar plots can be obtained for other distributions. 


EXAMPLE 9.1.6 Location-Scale Normal 
Suppose we want to assess whether or not the following data set can be considered a 
sample of size n = 10 from some normal distribution. 


2.00 0.28 0.47 3.33 166 8.17 1.18 4.15 643 1.77 


The order statistics and associated normal scores for this sample are given in the fol- 
lowing table. 


XG 028 047 +118 1.66 177 
-!(i/(n+1)) | -1.34 —0.91 —0.61 —0.35 —0.12 
PE 


6 7 8 9 10 
XG) 2.00 333 415 648 8.17 
-l(i/m+1)) | 011 034 0.60 0.90 1.33 


Chapter 9: Model Checking 489 


The values 
œa, DIEA +1) 


are then plotted in Figure 9.1.6. There is some definite deviation from a straight line 
here, but note that it is difficult to tell whether this is unexpected in a sample of this 
size from a normal distribution. Again, simulating a few samples of the same size (say, 
from an N (0, 1) distribution) and looking at their normal probability plots is recom- 
mended. In this case, we conclude that the plot in Figure 9.1.6 looks reasonable. E 


Normal scores 
e 


Figure 9.1.6: Normal probability plot of the data in Example 9.1.6. 


We will see in Chapter 10 that the use of normal probability plots of standardized 
residuals is an important part of model checking for more complicated models. So, 
while they are not really needed here, we consider some of the characteristics of such 
plots when assessing whether or not a sample is from a location normal or location- 
scale normal model. 

Assume that n is large so that we can consider the standardized residuals, given 
by (9.1.3) or (9.1.4) as an approximate sample from the N (0, 1) distribution. Then a 
normal probability plot of the standardized residuals should be approximately linear, 
with y-intercept approximately equal to 0 and slope approximately equal to 1. If we 
get a substantial deviation from this, then we have evidence that the assumed model is 
incorrect. 

In Figure 9.1.7, we have plotted a normal probability plot of the standardized resid- 
uals for a sample of n = 25 from an N(0, 1) distribution. In Figure 9.1.8, we have 
plotted a normal probability plot of the standardized residuals for a sample of n = 25 
from the distribution given by XY = 37!/2Z, where Z ~ t(3). Both distributions have 
mean 0 and variance 1, so the difference in the normal probability plots is due to other 
distributional differences. 


490 


Normal scores 


Section 9.1: Checking the Sampling Model 


Standardized residuals 


Figure 9.1.7: Normal probability plot of the standardized residuals of a sample of 25 from an 


N (0, 1) distribution. 


Normal scores 


I I I 
-1 0 1 


Standardized residuals 


nH 
w 


Figure 9.1.8: Normal probability plot of the standardized residuals of a sample of 25 from 
X =37'/?Z where Z ~ (3). 


9.1.2 | The Chi-Squared Goodness of Fit Test 


The chi-squared goodness of fit test has an important historical place in any discussion 
of assessing model correctness. We use this test to assess whether or not a categorical 
random variable W, which takes its values in the finite sample space {1, 2,...,k}, has a 


specified probability measure P, after having observed a sample (w1, .. 


., Wn). When 


we have a random variable that is discrete and takes infinitely many values, then we 
partition the possible values into k categories and let W simply indicate which category 
has occurred. If we have a random variable that is quantitative, then we partition R! 
into k subintervals and let W indicate in which interval the response occurred. In effect, 
we want to check whether or not a specific probability model, as given by P, is correct 
for W based on an observed sample. 


Chapter 9: Model Checking 491 


Let (X1, ..., Xk) be the observed counts or frequencies of 1, ..., k, respectively. 
If P is correct, then, from Example 2.8.5, 


(Xi, ..., Xk) ~ Multinomial(n, pi,..., pk) 


where p; = P({i}). This implies that E (X;) = np; and Var(X;) = np;(1 — pi) (recall 
that X; ~ Binomial(n, p;)). From this, we deduce that 
pe a Boi (9.1.6) 
Vnpill = pi) 

as n — oo (see Example 4.4.9). 

For finite n, the distribution of R;, when the model is correct, is dependent on 
P, but the limiting distribution is not. Thus we can think of the R; as standardized 
residuals when n is large. Therefore, it would seem that a reasonable discrepancy 
Statistic is given by the sum of the squares of the standardized residuals with a 1 R? 
approximately distributed y? (k). The restriction x; + --- +x = n holds, however, so 
the R; are not independent and the limiting distribution is not y? (k). We do, however, 
have the following result, which provides a similar discrepancy statistic. 


Theorem 9.1.1 If (X1,..., Xk) ~ Multinomial(n, pı, ..., px), then 


k 


k 2 
(Xi — pi)” D 
a rrp or my 
c= — 


The proof of this result is a little too involved for this text, so see, for example, Theorem 
17.2 of Asymptotic Statistics by A. W. van der Vaart (Cambridge University Press, 
Cambridge, 1998), which we will use here. 

We refer to X as the chi-squared statistic. The process of assessing the correctness 
of the model by computing the P-value P(X? > Xe), where X? ~ y?(k — 1) and 
a is the observed value of the chi-squared statistic, is referred to as the chi-squared 
goodness of fit test. Small P-values near 0 provide evidence of the incorrectness of the 
probability model. Small P-values indicate that some of the residuals are too large. 

Note that the ith term of the chi-squared statistic can be written as 


(Xi - np) _ (number in the ith cell — expected number in the ith cell)? 
npi expected number in the ith cell i 


It is recommended, for example, in Statistical Methods, by G. Snedecor and W. Cochran 
(Iowa State Press, 6th ed., Ames, 1967) that grouping (combining cells) be employed to 
ensure that E (X;) = np; > 1 for every i, as simulations have shown that this improves 
the accuracy of the approximation to the P-value. 

We consider an important application. 


EXAMPLE 9.1.7 Testing the Accuracy of a Random Number Generator 
In effect, every Monte Carlo simulation can be considered to be a set of mathematical 
operations applied to a stream of numbers U1, U2,... in [0, 1] that are supposed to 


492 Section 9.1: Checking the Sampling Model 


be iid. Uniform[0, 1]. Of course, they cannot satisfy this requirement exactly because 
they are generated according to some deterministic function. Typically, a function 
f : (0, 1]” — [0, 1] is chosen and is applied iteratively to obtain the sequence. So we 
select U1, ..., Um as initial seed values and then Uni) = f (U1,...,Um),Um42 = 
f (U2,..., Um41), etc. There are many possibilities for f, and a great deal of re- 
search and study have gone into selecting functions that will produce sequences that 
adequately mimic the properties of an i.i.d. Uniform[0, 1] sequence. 

Of course, it is always possible that the underlying f used in a particular statistical 
package or other piece of software is very poor. In such a case, the results of the 
simulations can be grossly in error. How do we assess whether a particular f is good 
or not? One approach is to run a battery of statistical tests to see whether the sequence 
is behaving as we know an ideal sequence would. 

For example, if the sequence U1, U2, ... is i.i.d. Uniform[0, 1], then 


[1001], [1002], ... 


is i.i.d. Uniform{1,2,..., 10} ([x] denotes the smallest integer greater than x, e.g., 
[3.2] = 4). So we can test the adequacy of the underlying function f by generating 
U,,...,U, for large n, putting x; = [10U;], and then carrying out a chi-squared 
goodness of fit test with the 10 categories {1,..., 10} with each cell probability equal 
to 1/10. 

Doing this using a popular statistical package (with n = 104) gave the following 
table of counts x; and standardized residuals r; as specified in (9.1.6). 


2.03333 

0.70000 

1017 0.56667 
973 —0.90000 
975 —0.83333 
965 —1.16667 
996 —0.13333 
955 —1.50000 


1 
2 
3 
4 
5 
6 
7 
8 
9 
0 


— 


All the standardized residuals look reasonable as possible values from an N (0, 1) dis- 
tribution. Furthermore, 


(—0.23333)? + (1.46667)? + (2.03333) 

+ (0.70000)? + (0.56667)? + (—0.90000)? 

+ (—0.83333)? + (—1.16667)? + (—0.13333)? 
+ (—1.50000)? 


Me (1 — 0.1) 


11.0560 


gives the P-value P(X? > 11.0560) = 0.27190 when X* ~ y7(9). This indicates that 
we have no evidence that the random number generator is defective. 


Chapter 9: Model Checking 493 


Of course, the story does not end with a single test like this. Many other features 
of the sequence should be tested. For example, we might want to investigate the inde- 
pendence properties of the sequence and so test if each possible combination of (i, j) 
occurs with probability 1/100, etc. E 


More generally, we will not have a prescribed probability distribution P for X but 
rather a statistical model {Pg : 0 € Q}, where each Pg is a probability measure on the 
finite set {1,2,...,}. Then, based on the sample from the model, we have that 


(X,..., XK) ~ Multinomial(n, pi), ..., pr(O)) 
where p;(0) = Po({i}). 


Perhaps a natural way to assess whether or not this model fits the data is to find the 
MLE @ from the likelihood function 


L@|x1,..-, xk) = (p1@))"" --- Pe@))™ 
and then look at the standardized residuals 
xi — npi ô) 


Jnp:@ -pÂ 


We have the following result, which we state without proof. 


riÔ) = 


Theorem 9.1.2 Under conditions (similar to those discussed in Section 6.5), we 
have that R; (0) Ex N(O, 1) and 


k Ay\2 
(Xi —npi))” D > ; 

> M k — 1 —dimQ 

i=l np: (0) a ee 


k 
X? = 001 — pi) 2? Ê) = 


asn —> OOo. 


By dim Q, we mean the dimension of the set Q. Loosely speaking, this is the mini- 
mum number of coordinates required to specify a point in the set, e.g., a line requires 
one coordinate (positive or negative distance from a fixed point), a circle requires one 
coordinate, a plane in R? requires two coordinates, etc. Of course, this result implies 
that the number of cells must satisfy k > 1 + dimQ. 

Consider an example. 


EXAMPLE 9.1.8 Testing for Exponentiality 

Suppose that a sample of lifelengths of light bulbs (measured in thousands of hours) 
is supposed to be from an Exponential(@) distribution, where 90 € Q = (0,0) is 
unknown. So here dimQ = 1, and we require at least two cells for the chi-squared 
test. The manufacturer expects that most bulbs will last at least 1000 hours, 50% will 
last less than 2000 hours, and most will have failed by 3000 hours. So based on this, 
we partition the sample space as 


(0, œ) = (0, 1] U (1, 2] U (2, 3] UG, oo). 


494 Section 9.1: Checking the Sampling Model 


Suppose that a sample of n = 30 light bulbs was taken and that the counts x; = 5, 
x2 = 16, x3 = 8, and x4 = 1 were obtained for the four intervals, respectively. Then 
the likelihood function based on these counts is given by 


L(@|x1,...,x40 = a —e 5 (6 0 —e 20)l6 (e 20 —e 38)8 (e Wk 


6 2 


because, for example, the probability of a value falling in (1, 2] is e~” — e7 9 and we 
have x2 = 16 observations in this interval. Figure 9.1.9 is a plot of the log-likelihood. 


theta 
18 ~ 20 


InL 


-200 F 


-300 
-400 


-500 


Figure 9.1.9: Plot of the log-likelihood function in Example 9.1.8. 


By successively plotting the likelihood on shorter and shorter intervals, the MLE 
was determined to be 6 = 0.603535. This value leads to the probabilities 


pid) = 1-67 %9535 = 0.453125, 
pô) 2 e7 0.603535 = e72(0.603535) = 0.247803, 
pô) =) e2(0.603535) = e3(0.603535) = 0.135517, 
pad) = e73(0-603535) _ 0,163555, 
the fitted values 
30710) = 13.59375, 
30p2(0) = 7.43409, 
30p3(0) = 4.06551, 


30p4(0) = 4.90665, 


Chapter 9: Model Checking 495 


and the standardized residuals 


rı) = (5 — 13.59375) /./30 (0.453125) (1 — 0.453125) = —3.151875, 
ra(Ô) = (16 — 7.43409) /./30 (0.247803) (1 — 0.247803) = 3.622378, 
r3(0) = (8 — 4.06551) /,/30 (0.135517) (1 — 0.135517) = 2.098711, 
r4(0) = (1 — 4.90665) /./30 (0.163555) (1 — 0.163555) = —1.928382. 


Note that two of the standardized residuals look large. Finally, we compute 


XZ = (1 — 0.453125) (—3.151875)? + (1 — 0.247803) (3.622378)? 
+ (1 — 0.135517) (2.098711)? + (1 — 0.163555) (—1.928382)? 
= 22.221018 


and 
P(X? > 22.221018) = 0.0000 


when X? ~ 7? (2) . Therefore, we have strong evidence that the Exponential (9) model 
is not correct for these data and we would not use this model to make inference about 
0. 

Note that we used the MLE of 0 based on the count data and not the original sample! 
If instead we were to use the MLE for 0 based on the original sample (in this case, equal 
to x and so much easier to compute), then Theorem 9.1.2 would no longer be valid. E 


The chi-squared goodness of fit test is but one of many discrepancy statistics that 
have been proposed for model checking in the statistical literature. The general ap- 
proach is to select a discrepancy statistic D, like X*, such that the exact or asymptotic 
distribution of D is independent of 0 and known. We then compute a P-value based on 
D. The Kolmogorov—Smirnov test and the Cramer—von Mises test are further examples 
of such discrepancy statistics, but we do not discuss these here. 


9.1.3 | Prediction and Cross-Validation 


Perhaps the most rigorous test that a scientific model or theory can be subjected to 
is assessing how well it predicts new data after it has been fit to an independent data 
set. In fact, this is a crucial step in the acceptance of any new empirically developed 
scientific theory — to be accepted, it must predict new results beyond the data that led 
to its formulation. 

If a model does not do a good job at predicting new data, then it is reasonable to say 
that we have evidence against the model being correct. If the model is too simple, then 
the fitted model will underfit the observed data and also the future data. If the model is 
too complicated, then the model will overfit the original data, and this will be detected 
when we consider the new data in light of this fitted model. 

In statistical applications, we typically cannot wait until new data are generated to 
check the model. So statisticians use a technique called cross-validation. For this, we 
split an original data set x1, ..., Xn into two parts: the training set T, comprising k of 
the data values and used to fit the model; and the validation set V, which comprises 


496 Section 9.1: Checking the Sampling Model 


the remaining n — k data values. Based on the training data, we construct predictors of 
various aspects of the validation data. Using the discrepancies between the predicted 
and actual values, we then assess whether or not the validation set V is surprising as a 
possible future sample from the model. 
n 
(() 


Of course, there are 
possible such splits of the data and we would not want to make a decision based on 
just one of these. So a cross-validational analysis will have to take this into account. 
Furthermore, we will have to decide how to measure the discrepancies between T and 
V and choose a value for k. We do not pursue this topic any further in this text. 


9.1.4| What Do We Do When a Model Fails? 


So far we have been concerned with determining whether or not an assumed model is 
appropriate given observed data. Suppose the result of our model checking is that we 
decide a particular model is inappropriate. What do we do now? 

Perhaps the obvious response is to say that we have to come up with a more appro- 
priate model — one that will pass our model checking. It is not obvious how we should 
go about this, but statisticians have devised some techniques. 

One of the simplest techniques is the method of transformations. For example, sup- 
pose that we observe a sample y1, ..., Yn from the distribution given by Y = exp(X) 
with X ~ N(u, 07). A normal probability plot based on the y;, as in Figure 9.1.10, 
will detect evidence of the nonnormality of the distribution. Transforming these y; 
values to In y; will, however, yield a reasonable looking normal probability plot, as in 
Figure 9.1.11. 

So in this case, a simple transformation of the sample yields a data set that passes 
this check. In fact, this is something statisticians commonly do. Several transforma- 
tions from the family of power transformations given by Y” for p Æ 0, or the logarithm 
transformation In Y, are tried to see if a distributional assumption can be satisfied by a 
transformed sample. We will see some applications of this in Chapter 10. Surprisingly, 
this simple technique often works, although there are no guarantees that it always will. 

Perhaps the most commonly applied transformation is the logarithm when our data 
values are positive (note that this is a necessity for this transformation). Another very 
common transformation is the square root transformation, i.e., p = 1/2, when we have 
count data. Of course, it is not correct to try many, many transformations until we find 
one that makes the probability plots or residual plots look acceptable. Rather, we try a 
few simple transformations. 


Chapter 9: Model Checking 


Normal scores 
oO 
| 
Se 


497 


Figure 9.1.10: A normal probability plot of a sample of n = 50 from the distribution given by 


Y =exp(X) with X ~ N(O, 1). 


Normal scores 


Figure 9.1.11: A normal probability plot of a sample of n = 50 from the distribution given by 


In Y, where Y = exp(X) and X ~ N(0, 1). 


Summary of Section 9.1 


e Model checking is a key component of the practical application of statistics. 


e One approach to model checking involves choosing a discrepancy statistic D and 
then assessing whether or not the observed value of D is surprising by computing 


a P-value. 


498 Section 9.1: Checking the Sampling Model 


e Computation of the P-value requires that the distribution of D be known under 
the assumption that the model is correct. There are two approaches to accom- 
plishing this. One involves choosing D to be ancillary, and the other involves 
computing the P-value using the conditional distribution of the data given the 
minimal sufficient statistic. 

e The chi-squared goodness of fit statistic is a commonly used discrepancy statis- 
tic. For large samples, with the expected cell counts determined by the MLE 
based on the multinomial likelihood, the chi-squared goodness of fit statistic is 
approximately ancillary. 


There are also many informal methods of model checking based on various plots 
of residuals. 


e Ifa model is rejected, then there are several techniques for modifying the model. 
These typically involve transformations of the data. Also, a model that fails a 
model-checking procedure may still be useful, if the deviation from correctness 
is small. 


EXERCISES 


9.1.1 Suppose the following sample is assumed to be from an N (0, 4) distribution with 
6 € R! unknown. 


1.8 21) S380 CS eS 1.1 1.0 0.0 3.3 1.0 


—0.4 —0.1 2.3 —1.6 1.1 —1.3 3.3 —4.9 —1.1 1.9 


Check this model using the discrepancy statistic of Example 9.1.1. 
9.1.2 Suppose the following sample is assumed to be from an N (0, 2) distribution with 
0 unknown. 


—0.4 19 —0.3 -02 00 00 —0.1 —1.1 20 04 


(a) Plot the standardized residuals. 
(b) Construct a normal probability plot of the standardized residuals. 
(c) What conclusions do you draw based on the results of parts (a) and (b)? 


9.1.3 Suppose the following sample is assumed to be from an N (u, o°) distribution, 
where u € R! and g? > 0 are unknown. 


14.0 94 121 134 63 85 71 124 133 91 


(a) Plot the standardized residuals. 
(b) Construct a normal probability plot of the standardized residuals. 
(c) What conclusions do you draw based on the results of parts (a) and (b)? 


9.1.4 Suppose the following table was obtained from classifying members of a sample 
ofn = 10 from a student population according to the classification variables A and B, 
where A = 0, | indicates male, female and B = 0, 1 indicates conservative, liberal. 


Chapter 9: Model Checking 499 


Check the model that says gender and political orientation are independent, using 
Fisher’s exact test. 

9.1.5 The following sample of n = 20 is supposed to be from a Uniform[0, 1] distrib- 
ution. 


0.11 056 0.72 0.18 0.26 032 0.42 0.22 0.96 0.04 


0.45 0.22 0.08 0.65 0.32 0.88 0.76 0.32 0.21 0.80 


After grouping the data, using a partition of five equal-length intervals, carry out the 
chi-squared goodness of fit test to assess whether or not we have evidence against this 
assumption. Record the standardized residuals. 

9.1.6 Suppose a die is tossed 1000 times, and the following frequencies are obtained 
for the number of pips up when the die comes to a rest. 


X1 X2 X3 X4 X5 X6 
163 178 142 150 183 184 


Using the chi-squared goodness of fit test, assess whether we have evidence that this is 
not a symmetrical die. Record the standardized residuals. 

9.1.7 Suppose the sample space for a response is given by S = {0, 1, 2,3,...}. 

(a) Suppose that a statistician believes that in fact the response will lie in the set S = 
{10, 11, 12, 13, ...} and so chooses a probability measure P that reflects this. When 
the data are collected, however, the value s = 3 is observed. What is an appropriate 
P-value to quote as a measure of how surprising this value is as a potential value from 
P? 

(b) Suppose instead P is taken to be a Geometric(0.1) distribution. Determine an ap- 
propriate P-value to quote as a measure of how surprising s = 3 is as a potential value 
from P. 


9.1.8 Suppose we observe s = 3 heads inn = 10 independent tosses of a purportedly 
fair coin. Compute a P-value that assesses how surprising this value is as a potential 
value from the distribution prescribed. Do not use the chi-squared test. 

9.1.9 Suppose you check a model by computing a P-value based on some discrepancy 
statistic and conclude that there is no evidence against the model. Does this mean the 
model is correct? Explain your answer. 

9.1.10 Suppose you are told that standardized scores on a test are distributed N (0, 1). 
A student’s standardized score is —4. Compute an appropriate P-value to assess whether 
or not the statement is correct. 

9.1.11 Itis asserted that a coin is being tossed in independent tosses. You are somewhat 
skeptical about the independence of the tosses because you know that some people 
practice tossing coins so that they can increase the frequency of getting a head. So you 
wish to assess the independence of (x1, ..., Xn) from a Bernoulli(@) distribution. 

(a) Show that the conditional distribution of (x1, ..., Xn) given x is uniform on the set 
of all sequences of length n with entries from {0, 1}. 

(b) Using this conditional distribution, determine the distribution of the number of 1’s 
observed in the first |7/2 | observations. (Hint: The hypergeometric distribution.) 


500 Section 9.1: Checking the Sampling Model 


(c) Suppose you observe (1, 1, 1, 1, 1, 0, 0, 0, 0, 1). Compute an appropriate P-value to 
assess the independence of these tosses using (b). 


COMPUTER EXERCISES 


9.1.12 For the data of Exercise 9.1.1, present a normal probability plot of the standard- 
ized residuals and comment on it. 

9.1.13 Generate 25 samples from the N(0, 1) distribution with n = 10 and look at 
their normal probability plots. Draw any general conclusions. 

9.1.14 Suppose the following table was obtained from classifying members of a sam- 
ple on n = 100 from a student population according to the classification variables 4 
and B, where A = 0, 1 indicates male, female and B = 0,1 indicates conservative, 
liberal. 


Check the model that gender and political orientation are independent using Fisher’s 
exact test. 

9.1.15 Using software, generate a sample of n = 1000 from the Binomial(10, 0.2) 
distribution. Then, using the chi-squared goodness of fit test, check that this sample is 
indeed from this distribution. Use grouping to ensure E(X;) = np; > 1. What would 
you conclude if you got a P-value close to 0? 

9.1.16 Using a statistical package, generate a sample of n = 1000 from the Poisson(5) 
distribution. Then, using the chi-squared goodness of fit test based on grouping the 
observations into five cells chosen to ensure E(X;) = np; > 1, check that this sample 
is indeed from this distribution. What would you conclude if you got a P-value close 
to 0? 

9.1.17 Using a statistical package, generate a sample of n = 1000 from the N (0, 1) 
distribution. Then, using the chi-squared goodness of fit test based on grouping the 
observations into five cells chosen to ensure E(X;) = np; > 1, check that this sample 
is indeed from this distribution. What would you conclude if you got a P-value close 
to 0? 


PROBLEMS 


9.1.18 (Multivariate normal distribution) A random vector Y = (Y,..., Yp) is said to 
have a multivariate normal distribution with mean vector u € R* and variance matrix 
x = (oi;) € RY if 


k k k 
an ttat~N (Dan Saasen) 
i=l i=l j=l 


for every choice of aj,..., ap € R!. We write Y ~ Nx(u, X). Prove that E (Y;) = li, 
Cov(Y;, ¥;) = oj; and that Y; ~ N (u;i, cii). (Hint: Theorem 3.3.4.) 


Chapter 9: Model Checking 501 


9.1.19 In Example 9.1.1, prove that the residual R = (R,..., Rn) is distributed mul- 
tivariate normal (see Problem 9.1.18) with mean vector u = (0,...,0) and variance 
matrix & = (a;;) € REXK where oj; = —o3/n when i Æ j and cii = ofl —I1/n). 
(Hint: Theorem 4.6.1.) 

9.1.20 If Y = (%,..., Yp) is distributed multivariate normal with mean vector u € RE 
and variance matrix È = (aj;) € R**k and if X = (X1,..., X1) is distributed multi- 
variate normal with mean vector v € R’ and variance matrix Y = (t; je R!*! then it 
can be shown that Y and X are independent whenever ey a; Y; and S biX;i are 
independent for every choice of (a1, ..., ap) and (b1, ..., br). Use this fact to show 
that, in Example 9.1.1, X and R are independent. (Hint: Theorem 4.6.2 and Problem 
9.1.19.) 

9.1.21 In Example 9.1.4, prove that (41, By) = (x1./n,x.1/n) is the MLE. 


9.1.22 In Example 9.1.4, prove that the number of samples satisfying the constraints 


(9.1.2) equals 
n n 
(e) G) 


(Hint: Using i for the count x11, show that the number of such samples equals 


min{x1.,x.1} 
2. 
ae i=max{0,x1.+x.1—n} i xX. — i 
and sum this using the fact that the sum of Hypergeometric (n, x.1, x1.) probabilities 
equals 1.) 


COMPUTER PROBLEMS 


9.1.23 For the data of Exercise 9.1.3, carry out a simulation to estimate the P-value for 
the discrepancy statistic of Example 9.1.2. Plot a density histogram of the simulated 
values. (Hint: See Appendix B for appropriate code.) 

9.1.24 When n = 10, generate 10* values of the discrepancy statistic in Example 9.1.2 
when we have a sample from an N (0, 1) distribution. Plot these in a density histogram. 
Repeat this, but now generate from a Cauchy distribution. Compare the histograms (do 
not forget to make sure both plots have the same scales). 

9.1.25 The following data are supposed to have come from an Exponential(Q) distrib- 
ution, where 0 > 0 is unknown. 


15 16 14 9.7 121 2.7 2.2 16 68 0.1 


0.8 1.7 80 02 123 22 02 06 101 4.9 


Check this model using a chi-squared goodness of fit test based on the intervals 
(—o0, 2.0], (2.0, 4.0], (4.0, 6.0], (6.0, 8.0], (8.0, 10.0], (10.0, 00). 


(Hint: Calculate the MLE by plotting the log-likelihood over successively smaller in- 
tervals.) 


502 Section 9.2: Checking for Prior—Data Conflict 


9.1.26 The following table, taken from Jntroduction to the Practice of Statistics, by D. 
Moore and G. McCabe (W. H. Freeman, New York, 1999), gives the measurements in 
milligrams of daily calcium intake for 38 women between the ages of 18 and 24 years. 


1062 970 909 802 374 416 784 997 
438 1420 1425 948 1050 976 572 403 


1253 549 1325 446 465 1269 671 696 
1933 748 1203 2433 1255 110 


(a) Suppose that the model specifies a location normal model for these data with o? = 
(500). Carry out a chi-squared goodness of fit test on these data using the intervals 
(—oo, 600], (600, 1200], (1200, 1800], (1800, oo). (Hint: Plot the log-likelihood over 
successively smaller intervals to determine the MLE to about one decimal place. To 
determine the initial range for plotting, use the overall MLE of u minus three standard 
errors to the overall MLE plus three standard errors.) 

(b) Compare the MLE of u obtained in part (a) with the ungrouped MLE. 

(c) It would be more realistic to assume that the variance g? is unknown as well. Record 
the log-likelihood for the grouped data. (More sophisticated numerical methods are 
needed to find the MLE of (u4, ø?) in this case.) 


9.1.27 Generate 10* values of the discrepancy statistics Dskew and Dkurtosis in Example 
9.1.2 when we have a sample of n = 10 from an N(O, 1) distribution. Plot these 
in density histograms. Indicate how you would use these histograms to assess the 
normality assumption when we had an actual sample of size 10. Repeat this for n = 20 
and compare the distributions. 


CHALLENGES 


9.1.28 (MV) Prove that when (x1, ..., Xn) is a sample from the distribution given by 
“+ oZ, where Z has a known distribution and (u, o°) € R! x (0, 00) is unknown, 


then the statistic 7 E 
xı — xX Xn — X 
ret et) = (FS, k ) 


s s 
is ancillary. (Hint: Write a sample element as x; = u + øz; and then show that 
r(x1,..., Xn) can be written as a function of the z;.) 


9.2 | Checking for Prior—Data Conflict 


Bayesian methodology adds the prior probability measure IT to the statistical model 
{Po : 0 € Q}, for the subsequent statistical analysis. The methods of Section 9.1 are 
designed to check that the observed data can realistically be assumed to have come 
from a distribution in {Pg : 0 € Q}. When we add the prior, we are in effect saying 
that our knowledge about the true distribution leads us to assign the prior predictive 
probability M, given by M(A) = Ep (Po(A)) for A C Q, to describe the process 
generating the data. So it would seem, then, that a sensible Bayesian model-checking 


Chapter 9: Model Checking 503 


approach would be to compare the observed data s with the distribution given by M, 
to see if it is surprising or not. 

Suppose that we were to conclude that the Bayesian model was incorrect after 
deciding that s is a surprising value from M. This only tells us, however, that the 
probability measure M is unlikely to have produced the data and not that the model 
{Po : 0 e Q} was wrong. Consider the following example. 


EXAMPLE 9.2.1 Prior—Data Conflict 

Suppose we obtain a sample consisting of n = 20 values of s = 1 from the model with 
Q = {1,2} and probability functions for the basic response given by the following 
table. 


Then the probability of obtaining this sample from f> is given by (0.9)°° = 0.12158, 
which is a reasonable value, so we have no evidence against the model { f1, f2} 
Suppose we place a prior on Q given by IT ({1}) = 0.9999, so that we are virtually 
certain that 9 = 1. Then the probability of getting these data from the prior predictive 
Mis 
(0.9999) (0.1)7° + (0.0001) (0.9)°° = 1.2158 x 1075. 


The prior probability of observing a sample of 20, whose prior predictive probability is 
no greater than 1.2158 x 1075, can be calculated (using statistical software to tabulate 
the prior predictive) to be approximately 0.04. This tells us that the observed data are 
“in the tails” of the prior predictive and thus are surprising, which leads us to conclude 
that we have evidence that M is incorrect. 

So in this example, checking the model {fg : 0 € Q} leads us to conclude that it is 
plausible for the data observed. On the other hand, checking the model given by M 
leads us to the conclusion that the Bayesian model is implausible. E 


The lesson of Example 9.2.1 is that we can have model failure in the Bayesian con- 
text in two ways. First, the data s may be surprising in light of the model {fg : 0 € Q}. 
Second, even when the data are plausibly from this model, the prior and the data may 
conflict. This conflict will occur whenever the prior assigns most of its probability to 
distributions in the model for which the data are surprising. In either situation, infer- 
ences drawn from the Bayesian model may be flawed. 

If, however, the prior assigns positive probability (or density) to every possible 
value of 0, then the consistency results for Bayesian inference mentioned in Chapter 7 
indicate that a large amount of data will overcome a prior—data conflict (see Example 
9.2.4). This is because the effect of the prior decreases with increasing amounts of data. 
So the existence of a prior—data conflict does not necessarily mean that our inferences 
are in error. Still, it is useful to know whether or not this conflict exists, as it is often 
difficult to detect whether or not we have sufficient data to avoid the problem. 

Therefore, we should first use the checks discussed in Section 9.1 to ensure that the 
data s is plausibly from the model {fg : 0 € Q}. If we accept the model, then we look 
for any prior—data conflict. We now consider how to go about this. 


504 Section 9.2: Checking for Prior—Data Conflict 


The prior predictive distribution of any ancillary statistic is the same as its distrib- 
ution under the sampling model, i.e., its prior predictive distribution is not affected by 
the choice of the prior. So the observed value of any ancillary statistic cannot tell us 
anything about the existence of a prior—data conflict. We conclude from this that, if we 
are going to use some function of the data to assess whether or not there is prior—data 
conflict, then its marginal distribution has to depend on 0. 

We now show that the prior predictive conditional distribution of the data given a 
minimal sufficient statistic T is independent of the prior. 


Theorem 9.2.1 Suppose T is a sufficient statistic for the model { fg : 0 € Q} for 
data s. Then the conditional prior predictive distribution of the data s given T is 


independent of the prior m. 


PROOF | We will prove this in the case that each sample distribution fg and the prior 
m are discrete. A similar argument can be developed for the more general case. 
By Theorem 6.1.1 (factorization theorem) we have that 


Jols) = h(s)go(T(s)) 


for some functions gg and h. Therefore the prior predictive probability function of s is 
given by 
m(s) = h(s) X go(T(s))x(@). 


06Q 


The prior predictive probability function of T at t is given by 


m*(t)= >) hs) >) got)x@). 


{s:T (s)=t} 0EQ 


Therefore, the conditional prior predictive probability function of the data s given 
T(s)=tis 


h(s) X oco go Or (0) h(s) 
PY E ETET A 
a i) Pisren=0 60 Moen 206070) Eiren t OY 


which is independent of z . E 


So, from Theorem 9.2.1, we conclude that any aspects of the data, beyond the value 
of a minimal sufficient statistic, can tell us nothing about the existence of a prior- 
data conflict. Therefore, if we want to base our check for a prior—data conflict on the 
prior predictive, then we must use the prior predictive for a minimal sufficient statistic. 
Consider the following examples. 


EXAMPLE 9.2.2 Checking a Beta Prior for a Bernoulli Model 

Suppose that (x1, ..., Xn) is a sample from a Bernoulli(@) model, where 0 e€ [0, 1] is 
unknown, and @ is given a Beta(a, f) prior distribution. Then we have that the sample 
count y = $; x; is a minimal sufficient statistic and is distributed Binomial(n, 0). 


Chapter 9: Model Checking 505 


Therefore, the prior predictive probability function for y is given by 


1 
n yaon- LEP) ga- 4 _ 96-1 
Wp 0? (1—0) ror’ (1 -@) d0 


Tr(an+!) Ta@+fs) (v+a)l@-ytFf) 
TO+)Dr@-yt+lr@r T (n+a +p) 
Tot+a)Tm—-y+f) 

TO+)l@-yt+l) 


m(y) 


Now observe that when a = p = 1, thenm(y) = 1/(n+1), i.e., the prior predictive 
of y is Uniform{0, 1, ..., n}, and no values of y are surprising. This is not unexpected, 
as with the uniform prior on 6, we are implicitly saying that any count y is reasonable. 

On the other hand, when a = £ = 2, the prior puts more weight around 1/2. The 
prior predictive is then proportional to (y + 1)(n — y + 1). This prior predictive is 
plotted in Figure 9.2.1 when n = 20. Note that counts near 0 or 20 lead to evidence 
that there is a conflict between the data and the prior. For example, if we obtain the 
count y = 3, we can assess how surprising this value is by computing the probability 
of obtaining a value with a lower probability of occurrence. Using the symmetry of the 
prior predictive, we have that this probability equals (using statistical software for the 
computation) m (0) + m (2) +m (19) +m(20) = 0.0688876. Therefore, the observation 
y = 3 is not surprising at the 5% level. 


0.07 + ea 
° ° 
° ° 
0.06 — ° ° 
° ° 
@ = 
g os 3 7 
2 
SB om H ° ° 
a 
5 e ° 
a 0.03 — 
° ° 
0.02 — 
on Ls | 
0 10 20 
y 


Figure 9.2.1: Plot of the prior predictive of the sample count y in Example 9.2.2 when 
a =p = 2 andn = 20. 


Suppose now that n = 50 and a = 2,8 = 4. The mean of this prior is 2/(2 + 
4) = 1/3 and the prior is right-skewed. The prior predictive is plotted in Figure 9.2.2. 
Clearly, values of y near 50 give evidence against the model in this case. For example, 
if we observe y = 35, then the probability of getting a count with smaller probability of 
occurrence is given by (using statistical software for the computation) m(36) + ---+ 
m(50) = 0.0500457. Only values more extreme than this would provide evidence 
against the model at the 5% level. 


506 Section 9.2: Checking for Prior—Data Conflict 


0.04 — oe 
get %, 
° . 
e ti 
0.03 — ° . 
g e e 
3 . 
2 ° e 
2 ° 
© 002 A 
a ° e 
5 o 
= ° e 
a 
oo e 
° 
° % 
to., 
0.00 — teoooo00 
T T T T T T 
0 10 20 30 40 50 


Figure 9.2.2: Plot of the prior predictive of the sample count y in Example 9.2.2 when 
a =2, p = 4 andn = 50. 


| 
EXAMPLE 9.2.3 Checking a Normal Prior for a Location Normal Model 
Suppose that (x1, ...,Xn) is a sample from an N(u, o?) distribution, where u € R! 


is unknown and o? is known. Suppose we take the prior distribution of u to be an 


N (uo, A) for some specified choice of “9 and ae Note that x is a minimal sufficient 
statistic for this model, so we need to compare the observed of this statistic to its prior 
predictive distribution to assess whether or not there is prior—data conflict. 

Now we can write x = u +z, where u ~ N(uo, 3) independent of z ~ 
NO, o? /n). From this, we immediately deduce (see Exercise 9.2.3) that the prior pre- 
dictive distribution of x is N (uo, ti+o2 /n). From the symmetry of the prior predictive 
density about uo, we immediately see that the appropriate P-value is 


M(|X = ol < bE = Mol) = 20 — OF — uol/ (t3 +.0G/0)"7)). (0.2.1) 


So a small value of (9.2.1) is evidence that there is a conflict between the observed data 
and the prior, i.e., the prior is putting most of its mass on values of u for which the 
observed data are surprising. E 


Another possibility for model checking in this context is to look at the posterior 
predictive distribution of the data. Consider, however, the following example. 


EXAMPLE 9.2.4 (Example 9.2.1 continued) 
Recall that, in Example 9.2.1, we concluded that a prior—data conflict existed. Note, 
however, that the posterior probability of 0 = 2 is 


(0.0001) (0.9)7° S 
(0.9999) (0.1)2° + (0.0001) (0.92 


Therefore, the posterior predictive probability of the observed sequence of 20 values of 
1 is 0.12158, which does not indicate any prior—data conflict. We note, however, that 
in this example, the amount of data are sufficient to overwhelm the prior; thus we are 
led to a sensible inference about 0.1 


Chapter 9: Model Checking 507 


The problem with using the posterior predictive to assess whether or not a prior— 
data conflict exists is that we have an instance of the so-called double use of the data. 
For we have fit the model, i.e., constructed the posterior predictive, using the observed 
data, and then we tried to use this posterior predictive to assess whether or not a prior— 
data conflict exists. The double use of the data results in overly optimistic assessments 
of the validity of the Bayesian model and will often not detect discrepancies. We will 
not discuss posterior model checking further in this text. 

We have only touched on the basics of checking for prior—data conflict here. With 
more complicated models, the possibility exists of checking individual components ofa 
prior, e.g., the components of the prior specified in Example 7.1.4 for the location-scale 
normal model, to ascertain more precisely where a prior—data conflict is arising. Also, 
ancillary statistics play a role in checking for prior—data conflict as we must remove any 
ancillary variation when computing the P-value because this variation does not depend 
on the prior. Furthermore, when the prior predictive distribution of a minimal sufficient 
statistic is continuous, then issues concerning exactly how P-values are to be computed 
must be addressed. These are all topics for a further course in statistics. 


Summary of Section 9.2 


e In Bayesian inference, there are two potential sources of model incorrectness. 
First, the sampling model for the data may be incorrect. Second, even if the 
sampling model is correct, the prior may conflict with the data in the sense that 
most of the prior probability is assigned to distributions in the model for which 
the data are surprising. 


We first check for the correctness of the sampling model using the methods of 
Section 9.1. If we do not find evidence against the sampling model, we next 
check for prior—data conflict by seeing if the observed value of a minimal suffi- 
cient statistic is surprising or not, with respect to the prior predictive distribution 
of this quantity. 


Even if a prior—data conflict exists, posterior inferences may still be valid if we 
have enough data. 


EXERCISES 


9.2.1 Suppose we observe the value s = 2 from the model, given by the following 
table. 


sh S38 


fits)| 13 18° 1/3 
fas) | 1/3 0 2/3 


(a) Do the observed data lead us to doubt the validity of the model? Explain why or 
why not. 

(b) Suppose the prior, given by z(1) = 0.3, is placed on the parameter 0 €e {1,2}. 
Is there any evidence of a prior—data conflict? (Hint: Compute the prior predictive for 
each possible data set and assess whether or not the observed data set is surprising.) 


508 Section 9.2: Checking for Prior—Data Conflict 


(c) Repeat part (b) using the prior given by æ (1) = 0.01. 

9.2.2 Suppose a sample of n = 6 is taken from a Bernoulli(@) distribution, where 0 
has a Beta(3, 3) prior distribution. If the value nx = 2 is obtained, then determine 
whether there is any prior—data conflict. 

9.2.3 In Example 9.2.3, establish that the prior predictive distribution of x is given by 
the N (uo, Tå + o/n) distribution. 

9.2.4 Suppose we have a sample of n = 5 from an N(y, 2) distribution where u is 
unknown and the value x = 7.3 is observed. An N(0, 1) prior is placed on u. Compute 
the appropriate P-value to check for prior—data conflict. 

9.2.5 Suppose that x ~ Uniform[0, 0] and 6 ~Uniform[0, 1]. If the value x = 2.2 is 
observed, then determine an appropriate P-value for checking for prior—data conflict. 


COMPUTER EXERCISES 


9.2.6 Suppose a sample of n = 20 is taken from a Bernoulli(@) distribution, where 
0 has a Beta(3, 3) prior distribution. If the value nx = 6 is obtained, then determine 
whether there is any prior—data conflict. 


PROBLEMS 


9.2.7 Suppose that (x1, ..., Xn) is a sample from an N (u, o?) distribution, where u ~ 
N (uo, Tô). Determine the prior predictive distribution of x. 


9.2.8 Suppose that (x1, ...,X,) is a sample from an Exponential (9) distribution where 
0 ~ Gamma(ao, fo). Determine the prior predictive distribution of x. 


9.2.9 Suppose that (s1, ..., Sn) is a sample from a Multinomial(1, 0;,..., @,) distri- 


bution, where (01, ...,0%—1) ~ Dirichlet(a1, ..., ag). Determine the prior predictive 
distribution of (x1, ..., x), Where x; is the count in the ith category. 
9.2.10 Suppose that (x1, ..., Xn) is a sample from a Uniform[0, 0] distribution, where 


8 has prior density given by 7 a, g (0) = 0~* I[g,o0) (8) /(a — 1)£2-!, where a > 1, B > 
0. Determine the prior predictive distribution of x(n). 

9.2.11 Suppose we have the context of Example 9.2.3. Determine the limiting P-value 
for checking for prior—data conflict as n — oo. Interpret the meaning of this P-value 
in terms of the prior and the true value of u. 

9.2.12 Suppose that x ~ Geometric (@) distribution and 0 ~ Uniform[0, 1]. 

(a) Determine the appropriate P-value for checking for prior—data conflict. 

(b) Based on the P-value determined in part (a), describe the circumstances under which 
evidence of prior—data conflict will exist. 

(c) If we use a continuous prior that is positive at a point, then this an assertion that 
the point is possible. In light of this, discuss whether or not a continuous prior that is 
positive at 0 = 0 makes sense for the Geometric(@) distribution. 


CHALLENGES 


9.2.13 Suppose that X1, ..., Xn is a sample from an N(u, 07) distribution where 
ulo? ~ N(uo, T3207) and 1/o* ~ Gamma(ao, Bo). Then determine a form for the 


Chapter 9: Model Checking 509 


prior predictive density of (X, S?) that you could evaluate without integrating. (Hint: 
Use the algebraic manipulations found in Section 7.5.) 


9.3 | The Problem with Multiple Checks 


As we have mentioned throughout this text, model checking is a part of good statistical 
practice. In other words, one should always be wary of the value of statistical work 
in which the investigators have not engaged in, and reported the results of, reasonably 
rigorous model checking. It is really the job of those who report statistical results to 
convince us that their models are reasonable for the data collected, bearing in mind the 
effects of both underfitting and overfitting. 

In this chapter, we have reported some of the possible model-checking approaches 
available. We have focused on the main categories of procedures and perhaps the 
most often used methods from within these. There are many others. At this point, we 
cannot say that any one approach is the best possible method. Perhaps greater insight 
along these lines will come with further research into the topic, and then a clearer 
recommendation could be made. 

One recommendation that can be made now, however, is that it is not reasonable to 
go about model checking by implementing every possible model-checking procedure 
you can. A simple example illustrates the folly of such an approach. 


EXAMPLE 9.3.1 
Suppose that (x1, ..., Xn) is supposed to be a sample from the N(0, 1) distribution. 
Suppose we decide to check this model by computing the P-values 


P; = P(X} > x?) 


fori = 1,...,n, where xX? ~ xa). Furthermore, we will decide that the model is 
incorrect if the minimum of these P-values is less than 0.05. 

Now consider the repeated sampling behavior of this method when the model is 
correct. We have that 


min {P},..., Pn} < 0.05 


if and only if 
max{x],...,%2} > x$95(1), 


and so 
P (min {P1, 2... Pa} < 0.05) 
= P(max{X7,...,X2} > y2.95(1)) = 1 — P(max{Xĵ, ..., X2} < 73. 95(1)) 
=] -[] Po? < %§.95(1)) = 1 — 0.95)” > 1 
i=l 


as n — oo. This tells us that if n is large enough, we will reject the model with virtual 
certainty even though it is correct! Note that n does not have to be very large for there 
to be an appreciable probability of making an error. For example, when n = 10, the 


510 Section 9.3: The Problem with Multiple Checks 


probability of making an error is 0.40; when n = 20 the probability of making an error 
is 0.64; and when n = 100, the probability of making an error is 0.99. E 


We can learn an important lesson from Example 9.3.1, for, if we carry out too many 
model-checking procedures, we are almost certain to find something wrong — even if 
the model is correct. The cure for this is that before actually observing the data (so 
that our choices are not determined by the actual data obtained), we decide on a few 
relevant model-checking procedures to be carried out and implement only these. 

The problem we have been discussing here is sometimes referred to as the problem 
of multiple comparisons, which comes up in other situations as well — e.g., see Sec- 
tion 10.4.1, where multiple means are compared via pairwise tests for differences in 
the means. One approach for avoiding the multiple-comparisons problem is to simply 
lower the cutoff for the P-value so that the probability of making a mistake is appro- 
priately small. For example, if we decided in Example 9.3.1 that evidence against the 
model is only warranted when an individual P-value is smaller than 0.0001, then the 
probability of making a mistake is 0.01 when n = 100. A difficulty with this approach 
generally is that our model-checking procedures will not be independent, and it does 
not always seem possible to determine an appropriate cutoff for the individual P-values. 
More advanced methods are needed to deal with this problem. 


Summary of Section 9.3 


e Carrying out too many model checks is not a good idea, as we will invariably 
find something that leads us to conclude that the model is incorrect. Rather than 
engaging in a “fishing expedition,” where we just keep on checking the model, 
it is better to choose a few procedures before we see the data, and use these, and 
only these, for the model checking. 


Chapter 10 


Relationships Among 
Variables 


CHAPTER OUTLINE 


Section 1 Related Variables 

Section 2 Categorical Response and Predictors 

Section 3 Quantitative Response and Predictors 

Section 4 Quantitative Response and Categorical Predictors 
Section 5 Categorical Response and Quantitative Predictors 
Section 6 Further Proofs (Advanced) 


In this chapter, we are concerned with perhaps the most important application of sta- 
tistical inference: the problem of analyzing whether or not a relationship exists among 
variables and what form the relationship takes. As a particular instance of this, recall 
the example and discussion in Section 5.1. 

Many of the most important problems in science and society are concerned with re- 
lationships among variables. For example, what is the relationship between the amount 
of carbon dioxide placed into the atmosphere and global temperatures? What is the re- 
lationship between class size and scholastic achievement by students? What is the 
relationship between weight and carbohydrate intake in humans? What is the relation- 
ship between lifelength and the dosage of a certain drug for cancer patients? These are 
all examples of questions whose answers involve relationships among variables. We 
will see that statistics plays a key role in answering such questions. 

In Section 10.1, we provide a precise definition of what it means for variables to 
be related, and we distinguish between two broad categories of relationship, namely, 
association and cause-effect. Also, we discuss some of the key ideas involved in col- 
lecting data when we want to determine whether a cause-effect relationship exists. In 
the remaining sections, we examine the various statistical methodologies that are used 
to analyze data when we are concerned with relationships. 

We emphasize the use of frequentist methodologies in this chapter. We give some 
examples of the Bayesian approach, but there are some complexities involved with the 
distributional problems associated with Bayesian methods that are best avoided at this 


511 


512 Section 10.1: Related Variables 


stage. Sampling algorithms for the Bayesian approach have been developed, along the 
lines of those discussed in Chapter 7 (see also Chapter 11), but their full discussion 
would take us beyond the scope of this text. It is worth noting, however, that Bayesian 
analyses with diffuse priors will often yield results very similar to those obtained via 
the frequentist approach. 

As discussed in Chapter 9, model checking is an important feature of any statistical 
analysis. For the models used in this chapter, a full discussion of the more rigorous P- 
value approach to model checking requires more development than we can accomplish 
in this text. As such, we emphasize the informal approach to model checking, via 
residual and probability plots. This should not be interpreted as a recommendation that 
these are the preferred methods for such models. 


10.1 | Related Variables 


Consider a population TI with two variables X, Y : II > R! defined on it. What does 
it mean to say that the variables X and Y are related? Perhaps our first inclination is to 
say that there must be a formula relating the two variables, such as Y = a + bX? for 
some choice of constants a and b, or Y = exp(X), etc. But consider a population IT 
of humans and suppose X(z) is the weight of z in kilograms and Y (m) is the height 
of individual z € II in centimeters. From our experience, we know that taller people 
tend to be heavier, so we believe that there is some kind of relationship between height 
and weight. We know, too, that there cannot be an exact formula that describes this 
relationship, because people with the same weight will often have different heights, 
and people with the same height will often have different weights. 


10.1.1 | The Definition of Relationship 


If we think of all the people with a given weight x, then there will be a distribution 
of heights for all those individuals z that have weight x. We call this distribution the 
conditional distribution of Y, given that X = x. 

We can now express what we mean by our intuitive idea that X and Y are related, 
for, as we change the value of the weight that we condition on, we expect the condi- 
tional distribution to change. In particular, as x increases, we expect that the location 
of the conditional distribution will increase, although other features of the distribution 
may change as well. For example, in Figure 10.1.1 we provide a possible plot of two 
approximating densities for the conditional distributions of Y given X = 70 kg and 
the conditional distribution of Y given X = 90 kg. 

We see that the conditional distribution has shifted up when X goes from 70 to 90 
kg but also that the shape of the distribution has changed somewhat as well. So we can 
say that a relationship definitely exists between X and Y, at least in this population. No- 
tice that, as defined so far, X and Y are not random variables, but they become so when 
we randomly select z from the population. In that case, the conditional distributions 
referred to become the conditional probability distributions of the random variable Y, 
given that we observe X = 70 and X = 90, respectively. 


Chapter 10: Relationships Among Variables 513 


0.05 T 


0.00 t t ; 
140 160 180 200 
X 


Figure 10.1.1: Plot of two approximating densities for the conditional distribution of Y given 
X = 70 kg (dashed line) and the conditional distribution of Y given X = 90 kg (solid line). 


We will adopt the following definition to precisely specify what we mean when we 
say that variables are related. 


Definition 10.1.1 Variables X and Y are related variables if there is any change in 


the conditional distribution of Y, given X = x, as x changes. 


We could instead define what it means for variables to be unrelated. We say that 
variables X and Y are unrelated if they are independent. This is equivalent to Definition 
10.1.1, because two variables are independent if and only if the conditional distribution 
of one given the other does not depend on the condition (Exercise 10.1.1). 

There is an apparent asymmetry in Definition 10.1.1, because the definition consid- 
ers only the conditional distribution of Y given X and not the conditional distribution 
of X given Y. But, if there is a change in the conditional distribution of Y given X = x, 
as we change x, then by the above comment, X and Y are not independent; thus there 
must be a change in the conditional distribution of X given Y = y, as we change y 
(also see Problem 10.1.23). 

Notice that the definition is applicable no matter what kind of variables we are 
dealing with. So both could be quantitative variables, or both categorical variables, or 
one could be a quantitative variable while the other is a categorical variable. 

Definition 10.1.1 says that X and Y are related if any change is observed in the 
conditional distribution. In reality, this would mean that there is practically always a 
relationship between variables X and Y. It seems likely that we will always detect some 
difference if we carry out a census and calculate all the relevant conditional distribu- 
tions. This is where the idea of the strength of a relationship among variables becomes 
relevant, for if we see large changes in the conditional distributions, then we can say a 
strong relationship exists. If we see only very small changes, then we can say a very 
weak relationship exists that is perhaps of no practical importance. 


514 Section 10.1: Related Variables 


The Role of Statistical Models 


If a relationship exists between two variables, then its form is completely described by 
the set of conditional distributions of Y given X. Sometimes it may be necessary to 
describe the relationship using all these conditional distributions. In many problems, 
however, we look for a simpler presentation. In fact, we often assume a statistical 
model that prescribes a simple form for how the conditional distributions change as we 
change X. 

Consider the following example. 


EXAMPLE 10.1.1 Simple Normal Linear Regression Model 

In Section 10.3.2, we will discuss the simple normal linear regression model, where 
the conditional distribution of quantitative variable Y, given the quantitative variable 
X = x, is assumed to be distributed 


N(B\ + Box, 07), 


where £4, 82, and g? are unknown. For example, Y could be the blood pressure of an 
individual and X the amount of salt the person consumed each day. 

In this case, the conditional distributions have constant shape and change, as x 
changes, only through the conditional mean. The mean moves along the line given by 
f+ x for some intercept 6, and slope 22. If this model is correct, then the variables 
are unrelated if and only if 8, = 0, as this is the only situation in which the conditional 
distributions can remain constant as we change x. E 


Statistical models, like that described in Example 10.1.1, can be wrong. There 
is nothing requiring that two quantitative variables must be related in that way. For 
example, the conditional variance of Y can vary with x, and the very shape of the 
conditional distribution can vary with x, too. The model of Example 10.1.1 is an 
instance of a simplifying assumption that is appropriate in many practical contexts. 
However, methods such as those discussed in Chapter 9 must be employed to check 
model assumptions before accepting statistical inferences based on such a model. We 
will always consider model checking as part of our discussion of the various models 
used to examine the relationship among variables. 


Response and Predictor Variables 


Often, we think of Y as a dependent variable (depending on X) and of X as an indepen- 
dent variable (free to vary). Our goal, then, is to predict the value of Y given the value 
of X. In this situation, we call Y the response variable and X the predictor variable. 

Sometimes, though, there is really nothing to distinguish the roles of X and Y. For 
example, suppose that X is the weight of an individual in kilograms and Y is the height 
in centimeters. We could then think of predicting weight from height or conversely. It 
is then immaterial which we choose to condition on. 

In many applications, there is more than one response variable and more than one 
predictor variable X. We will not consider the situation in which we have more than 
one response variable, but we will consider the case in which X = (X1,..., Xx) is 


Chapter 10: Relationships Among Variables 515 


k-dimensional. Here, the various predictors that make up X could be all categorical, 
all quantitative, or some mixture of categorical and quantitative variables. 
The definition of a relationship existing between response variable Y and the set of 


predictors (X1,..., Xx) is exactly as in Definition 10.1.1. In particular, a relationship 
exists between Y and (X1,..., Xx) if there is any change in the conditional distribution 
of Y given (X1,..., Xp) = (1, .--, xk) when (x1, ..., x;) is varied. If such a relation- 


ship exists, then the form of the relationship is specified by the full set of conditional 
distributions. Again, statistical models are often used where simplifying assumptions 
are made about the form of the relationship. Consider the following example. 


EXAMPLE 10.1.2 The Normal Linear Model with k Predictors 

In Section 10.3.4, we will discuss the normal multiple linear regression model. For 
this, the conditional distribution of quantitative variable Y, given that the quantitative 
predictors (X1, ..., X4) = (%1,..., Xk), is assumed to be the 


N(B, + Baxi +++ + Besixk, 07) 


distribution, where £),..., 84 and ø? are unknown. For example, Y could be blood 
pressure, X the amount of daily salt intake, X2 the age of the individual, X3 the weight 
of the individual, etc. 

In this case, the conditional distributions have constant shape and change, as the 
values of the predictors (x1, ..., xx) change only through the conditional mean, which 
changes according to the function J; + 82x1 +-+ -+ 41x. Notice that, if this model 
is correct, then the variables are unrelated if and only if $3 = --- = k41 = 0, as this 
is the only situation in which the conditional distributions can remain constant as we 
change (x1,...,X;). E 


When we split a set of variables Y, X1,..., Xp into response Y and predictors 
(X1,..., Xk), we are implicitly saying that we are directly interested only in the con- 
ditional distributions of Y given (X,..., Xx). There may be relationships among the 
predictors X,,..., Xk, however, and these can be of interest. 

For example, suppose we have two predictors X; and X2, and the conditional dis- 
tribution of X; given X3 is virtually degenerate at a value a + cX> for some constants 
a and c. Then it is not a good idea to include both X; and X7 in a model, such as 
that discussed in Example 10.1.2, as this can make the analysis very sensitive to small 
changes in the data. This is known as the problem of multicollinearity. The effect of 
multicollinearity, and how to avoid it, will not be discussed any further in this text. This 
is, however, a topic of considerable practical importance. 


Regression Models 


Suppose that the response Y is quantitative and we have k predictors (X1,..., Xx). 
One of the most important simplifying assumptions used in practice is the regression 
assumption, namely, we assume that, as we change (X1, ..., Xx), the only thing that 
can possibly change about the conditional distribution of Y given (X1, ..., Xx), is the 
conditional mean E (Y | .X1,..., Xk). The importance of this assumption is that, to an- 
alyze the relationship between Y and (X),..., X), we now need only consider how 


516 Section 10.1: Related Variables 


E(Y|X,..., Xk) changes as (X1,..., Xx) changes. Indeed, if E(Y | X1, ..., Xx) 
does not change as (X1,..., Xx) changes, then there is no relationship between Y 
and the predictors. Of course, this kind of an analysis is dependent on the regression 
assumption holding, and the methods of Section 9.1 must be used to check this. Regres- 
sion models — namely, statistical models where we make the regression assumption 
— are among the most important statistical models used in practice. Sections 10.3 and 
10.4 discuss several instances of regression models. 
Regression models are often presented in the form 


Y =E(Y|Xi,..., X) +Z, (10.1.1) 
where Z = Y — E(Y | X1, ..., Xk) is known as the error term. We see immedi- 
ately that, if the regression assumption applies, then the conditional distribution of Z 
given (X1, ..., Xp) is fixed as we change (X1, ..., Xx) and, conversely, if the con- 


ditional distribution of Z given (X1,..., Xx) is fixed as we change (X1, ..., Xx), 
then the regression assumption holds. So when the regression assumption applies, 
(10.1.1) provides a decomposition of Y into two parts: (1) a part possibly dependent on 
(X1,..., Xk), namely, E(Y | X1,..., Xk), and (2) a part that is always independent 
of (X1,..., Xk), namely, the error Z. Note that Examples 10.1.1 and 10.1.2 can be 
written in the form (10.1.1), where Z ~ N(0, o°). 


10.1.2 | Cause—Effect Relationships and Experiments 


Suppose now that we have variables X and Y defined on a population II and have 
concluded that a relationship exists according to Definition 10.1.1. This may be based 
on having conducted a full census of II, or, more typically, we will have drawn a 
simple random sample from II and then used the methods of the remaining sections of 
this chapter to conclude that such a relationship exists. If Y is playing the role of the 
response and if X is the predictor, then we often want to be able to assert that changes 
in X are causing the observed changes in the conditional distributions of Y. Of course, 
if there are no changes in the conditional distributions, then there is no relationship 
between X and Y and hence no cause-effect relationship, either. 

For example, suppose that the amount of carbon dioxide gas being released in the 
atmosphere is increasing, and we observe that mean global temperatures are rising. If 
we have reason to believe that the amount of carbon dioxide released can have an effect 
on temperature, then perhaps it is sensible to believe that the increase in carbon dioxide 
emissions is causing the observed increase in mean global temperatures. As another 
example, for many years it has been observed that smokers suffer from respiratory 
diseases much more frequently than do nonsmokers. It seems reasonable, then, to 
conclude that smoking causes an increased risk for respiratory disease. On the other 
hand, suppose we consider the relationship between weight and height. It seems clear 
that a relationship exists, but it does not make any sense to say that changes in one of 
the variables is causing the changes in the conditional distributions of the other. 


Chapter 10: Relationships Among Variables 517 


Confounding Variables 


When can we say that an observed relationship between X and Y is a cause-effect 
relationship? If a relationship exists between X and Y, then we know that there are at 
least two values x; and x2 such that fy( | X = x1) 4 fy(-|X = x2), i.e., these two 
conditional distributions are not equal. If we wish to say that this difference is caused 
by the change in X, then we have to know categorically that there is no other variable 
Z defined on II that confounds with X. The following example illustrates the idea of 
two variables confounding. 


EXAMPLE 10.1.3 

Suppose that II is a population of students such that most females hold a part-time 
job and most males do not. A researcher is interested in the distribution of grades, as 
measured by grade point average (GPA), and is looking to see if there is a relationship 
between GPA and gender. On the basis of the data collected, the researcher observes 
a difference in the conditional distribution of GPA given gender and concludes that a 
relationship exists between these variables. It seems clear, however, that an assertion 
of a cause-effect relationship existing between GPA and gender is not warranted, as 
the difference in the conditional distributions could also be attributed to the difference 
in part-time work status rather than gender. In this example, part-time work status and 
gender are confounded. E 


A more careful analysis might rescue the situation described in Example 10.1.3, for 
if X and Z denote the confounding variables, then we could collect data on Z as well 
and examine the conditional distributions fy ( | X =x, Z =z). In Example 10.1.3, 
these will be the conditional distributions of GPA, given gender and part-time work 
status. If these conditional distributions change as we change x, for some fixed value 
of z, then we could assert that a cause-effect relationship exists between X and Y 
provided there are no further confounding variables. Of course, there are probably still 
more confounding variables, and we really should be conditioning on all of them. This 
brings up the point that, in any practical application, we almost certainly will never 
even know all the potential confounding variables. 


Controlling Predictor Variable Assignments 


Fortunately, there is sometimes a way around the difficulties raised by confounding 
variables. Suppose we can control the value of the variable X for any z e€ II, i.e., 
we can assign the value x to z so that X(a) = x for any of the possible values of x. 
In Example 10.1.3, this would mean that we could assign a part-time work status to 
any student in the population. Now consider the following idealized situation. Imagine 
assigning every element æ € II the value X(az) = xı and then carrying out a census 
to obtain the conditional distribution fy(- | X = x1). Now imagine assigning every 
ax e II the value X(az) = x2 and then carrying out a census to obtain the conditional 
distribution fy(- |X = x2). If there is any difference in fy(-| X = xı) and fy( | X = 
x2), then the only possible reason is that the value of X differs. Therefore, if fy (- | X = 
x1) Æ fy( | X = x2), we can assert that a cause-effect relationship exists. 


518 Section 10.1: Related Variables 


A difficulty with the above argument is that typically we can never exactly deter- 
mine fy(- |X = xı) and fy(-|X = x2). But in fact, we may be able to sample from 
them; then the methods of statistical inference become available to us to infer whether 
or not there is any difference. Suppose we take a random sample 71, ..., Zn; 4n) from 
II and randomly assign nı of these the value X = x1, with the remaining z’s assigned 
the value x2. We obtain the Y values y11,...,¥1n, for those z’s assigned the value xı 
and obtain the Y values y21, ..., Y2n, for those m ’s assigned the value x2. Then it is ap- 
parent that yi1,..., Vin, isa sample from fy(-|X = x1) and y21,..., y2n, is a sample 
from fy(-|X = x2). In fact, provided that nı + n2 is small relative to the population 
size, then we can consider these as i.i.d. samples from these conditional distributions. 

So we see that in certain circumstances, it is possible to collect data in such a way 
that we can make inferences about whether or not a cause-effect relationship exists. 
We now specify the characteristics of the relevant data collection technique. 


Conditions for Cause—Effect Relationships 


First, if our inferences are to apply to a population II, then we must have a random 
sample from that population. This is just the characteristic of what we called a sampling 
study in Section 5.4, and we must do this to avoid any selection effects. So if the 
purpose of a study is to examine the relationship between the duration of migraine 
headaches and the dosage of a certain drug, the investigator must have a random sample 
from the population of migraine headache sufferers. 

Second, we must be able to assign any possible value of the predictor variable X 
to any selected z. If we cannot do this, or do not do this, then there may be hidden 
confounding variables (sometimes called lurking variables) that are influencing the 
conditional distributions of Y. So in a study of the effects of the dosage of a drug 
on migraine headaches, the investigator must be able to impose the dosage on each 
participant in the study. 

Third, after deciding what values of X we will use in our study, we must randomly 
allocate these values to members of the sample. This is done to avoid the possibility of 
selection effects. So, after deciding what dosages to use in the study of the effects of 
the dosage of a drug on migraine headaches, and how many participants will receive 
each dosage, the investigator must randomly select the individuals who will receive 
each dosage. This will (hopefully) avoid selection effects, such as only the healthiest 
individuals getting the lowest dosage, etc. 

When these requirements are met, we refer to the data collection process as an 
experiment. Statistical inference based on data collected via an experiment has the ca- 
pability of inferring that cause-effect relationships exist, so this represents an important 
and powerful scientific tool. 


A Hierarchy of Studies 


Combining this discussion with Section 5.4, we see a hierarchy of data collection meth- 
ods. Observational studies reside at the bottom of the hierarchy. Inferences drawn 
from observational studies must be taken with a degree of caution, for selection effects 
could mean that the results do not apply to the population intended, and the existence 


Chapter 10: Relationships Among Variables 519 


of confounding variables means that we cannot make inferences about cause-effect re- 
lationships. For sampling studies, we know that any inferences drawn will be about 
the appropriate population; but the existence of confounding variables again causes 
difficulties for any statements about the existence of cause-effect relationships, e.g., 
just taking random samples of males and females from the population II of Example 
10.1.3 will not avoid the confounding variables. At the top of the hierarchy reside 
experiments. 

It is probably apparent that it is often impossible to conduct an experiment. In 
Example 10.1.3, we cannot assign the value of gender, so nothing can be said about the 
existence of a cause-effect relationship between GPA and gender. 

There are many notorious examples in which assertions are made about the exis- 
tence of cause-effect relationships but for which no experiment is possible. For exam- 
ple, there have been a number of studies conducted where differences have been noted 
among the IQ distributions of various racial groups. It is impossible, however, to con- 
trol the variable racial origin, so it is impossible to assert that the observed differences 
in the conditional distributions of IQ, given race, are caused by changes in race. 

Another example concerns smoking and lung cancer in humans. It has been pointed 
out that it is impossible to conduct an experiment, as we cannot assign values of the 
predictor variable (perhaps different amounts of smoking) to humans at birth and then 
observe the response, namely, whether someone contracts lung cancer or not. This 
raises an important point. We do not simply reject the results of analyses based on 
observational studies or sampling studies because the data did not arise from an ex- 
periment. Rather, we treat these as evidence — potentially flawed evidence, but still 
evidence. 

Think of eyewitness evidence in a court of law suggesting that a crime was com- 
mitted by a certain individual. Eyewitness evidence may be unreliable, but if two or 
three unconnected eyewitnesses give similar reports, then our confidence grows in the 
reliability of the evidence. Similarly, if many observational and sampling studies seem 
to indicate that smoking leads to an increased risk for contracting lung cancer, then our 
confidence grows that a cause-effect relationship does indeed exist. Furthermore, if we 
can identify potentially confounding variables, then observational or sampling studies 
can be conducted taking these into account, increasing our confidence still more. Ul- 
timately, we may not be able to definitively settle the issue via an experiment, but it is 
still possible to build overwhelming evidence that smoking and lung cancer do have a 
cause-effect relationship. 


10.1.3 | Design of Experiments 


Suppose we have a response Y and a predictor X (sometimes called a factor in experi- 
mental contexts) defined on a population II, and we want to collect data to determine 
whether a cause-effect relationship exists between them. Following the discussion in 
Section 10.1.1, we will conduct an experiment. There are now a number of decisions 
to be made, and our choices constitute what we call the design of the experiment. 

For example, we are going to assign values of X to the sampled elements, now 
called experimental units, 71,...,n from II. Which of the possible values of X 


520 Section 10.1: Related Variables 


should we use? When X can take only a small finite number of values, then it is 
natural to use these values. On the other hand, when the number of possible values of 
X is very large or even infinite, as with quantitative predictors, then we have to choose 
values of X to use in the experiment. 

Suppose we have chosen the values x1,...,x, for X. We refer to x1,...,x,% as 
the levels of X; any particular assignment x; to a a; in the sample will be called a 
treatment. Typically, we will choose the levels so that they span the possible range of 
X fairly uniformly. For example, if X is temperature in degrees Celsius, and we want 
to examine the relationship between Y and X for X in the range [0, 100], then, using 
k = 5 levels, we might take xı = 0, x2 = 25, x3 = 50, x4 = 75, and x5 = 100. 

Having chosen the levels of X, we then have to choose how many treatments of each 
level we are going to use in the experiment, i.e., decide how many response values n; 
we are going to observe at level x; fori = 1,...,k. 

In any experiment, we will have a finite amount of resources (money, time, etc.) at 
our disposal, which determines the sample size n from II. The question then is how 
should we choose the n; so that nı + :-- + nk = n? If we know nothing about the 
conditional distributions fy(- |X = x;), then it makes sense to use balance, namely, 
choose nı =--: = nę. 

On the other hand, suppose we know that some of the fy(- |X = x;) will exhibit 
greater variability than others. For example, we might measure variability by the vari- 
ance of fy(- |X = x;). Then it makes sense to allocate more treatments to the levels of 
X where the response is more variable. This is because it will take more observations 
to make accurate inferences about characteristics of such an /fy(- |X = x;) than for the 
less variable conditional distributions. 

As discussed in Sections 6.3.4 and 6.3.5, we also want to choose the n; so that any 
inferences we make have desired accuracy. Methods for choosing the sample sizes n;, 
similar to those discussed in Chapter 7, have been developed for these more compli- 
cated designs, but we will not discuss these any further here. 

Suppose, then, that we have determined {(x1,71),..., (xx, nk)}. We refer to this 
set of ordered pairs as the experimental design. 

Consider some examples. 


EXAMPLE 10.1.4 

Suppose that IT is a population of students at a given university. The administration 
is concerned with determining the value of each student being assigned an academic 
advisor. The response variable Y will be a rating that a student assigns ona scale of | to 
10 (completely dissatisfied to completely satisfied with their university experience) at 
the end of a given semester. We treat Y as a quantitative variable. A random sample of 
n = 100 students is selected from II, and 50 of these are randomly selected to receive 
advisers while the remaining 50 are not assigned advisers. 

Here, the predictor X is a categorical variable that indicates whether or not the 
student has an advisor. There are only k = 2 levels, and both are used in the experiment. 
If x; = 0 denotes no advisor and x2 = 1 denotes having an advisor, then nı = n2 = 50 
and we have a balanced experiment. The experimental design is given by 


{(0, 50), (1, 50)}. 


Chapter 10: Relationships Among Variables 521 


At the end of the experiment, we want to use the data to make inferences about 
the conditional distributions fy(- |X = 0) and fy(- | X = 1) to determine whether a 
cause-effect relationship exists. The methods of Section 10.4 will be relevant for this. 
| 


EXAMPLE 10.1.5 
Suppose that II is a population of dairy cows. A feed company is concerned with the 
relationship between weight gain, measured in kilograms, over a specific time period 
and the amount of a supplement, measured in grams/liter, of an additive put into the 
cows’ feed. Here, the response Y is the weight gain — a quantitative variable. The pre- 
dictor X is the concentration of the additive. Suppose X can plausibly range between 
0 and 2, so it is also a quantitative variable. 

The experimenter decides to use k = 4 levels with x; = 0.00, x2 = 0.66,x3 = 
1.32, and x4 = 2.00. Further, the sample sizes ny = n2 = n3 = n4 = 10 were 
determined to be appropriate. So the balanced experimental design is given by 


{(0.00, 10) , (0.66, 10) , (1.32, 10) , (2.00, 10)}. 


At the end of the experiment, we want to make inferences about the conditional distri- 
butions fy(- |X = 0.00), fy |X = 0.66), fy( |X = 1.32), and fy(- |X = 2.00). 
The methods of Section 10.3 are relevant for this. E 


Control Treatment, the Placebo Effect, and Blinding 


Notice that in Example 10.1.5, we included the level X = 0, which corresponds to 
no application of the additive. This is called a control treatment, as it gives a baseline 
against which we can assess the effect of the predictor. In many experiments, it is 
important to include a control treatment. 

In medical experiments, there is often a placebo effect — that is, a disease sufferer 
given any treatment will often record an improvement in symptoms. The placebo effect 
is believed to be due to the fact that a sufferer will start to feel better simply because 
someone is paying attention to the condition. Accordingly, in any experiment to de- 
termine the efficacy of a drug in alleviating disease symptoms, it is important that a 
control treatment be used as well. For example, if we want to investigate whether or 
not a given drug alleviates migraine headaches, then among the dosages we select for 
the experiment, we should make sure that we include a pill containing none of the drug 
(the so-called sugar pill); that way we can assess the extent of the placebo effect. Of 
course, the recipients should not know whether they are receiving the sugar pill or the 
drug. This is called a blind experiment. If we also conceal the identity of the treatment 
from the experimenters, so as to avoid any biasing of the results on their part, then this 
is known as a double-blind experiment. 

In Example 10.1.5, we assumed that it is possible to take a sample from the popula- 
tion of all dairy cows. Strictly speaking, this is necessary if we want to avoid selection 
effects and make sure that our inferences apply to the population of interest. In prac- 
tice, however, taking a sample of experimental units from the full population of interest 
is often not feasible. For example, many medical experiments are conducted on ani- 


522 Section 10.1: Related Variables 


mals, and these are definitely not random samples from the population of the particular 
animal in question, e.g., rats. 

In such cases, however, we simply recognize the possibility that selection effects or 
lurking variables could render invalid the conclusions drawn from such analyses when 
they are to be applied to the population of interest. But we still regard the results as 
evidence concerning the phenomenon under study. It is the job of the experimenter to 
come as close as possible to the idealized situation specified by a valid experiment; for 
example, randomization is still employed when assigning treatments to experimental 
units so that selection effects are avoided as much as possible. 


Interactions 


In the experiments we have discussed so far, there has been one predictor. In many 
practical contexts, there is more than one predictor. Suppose, then, that there are two 
predictors X and W and that we have decided on the levels x;,..., x, for X and the 
levels w ,..., w; for W. One possibility is to look at the conditional distributions 
Sy( |X = xi) fori = 1,...,k and fy( | W = w;) for j = 1,...,/ to determine 
whether X and W individually have a relationship with the response Y. Such an ap- 
proach, however, ignores the effect of the two predictors together. In particular, the 
way the conditional distributions fy(-| X = x,W = w) change as we change x may 
depend on w; when this is the case, we say that there is an interaction between the 
predictors. 

To investigate the possibility of an interaction existing between X and W, we must 
sample from each of the kl distributions fy(- |X =x;,W = w;) fori =1,...,k and 
j =1,...,/. The experimental design then takes the form 


{@1, w1, 711), (2, W1, A21), ---, Xk, Wi, NkI)}, 


where n;; gives the number of applications of the treatment (xi, w j) We say that the 
two predictors X and W are completely crossed in such a design because each value 
of X used in the experiment occurs with each value of W used in the experiment. 
Of course, we can extend this discussion to the case where there are more than two 
predictors. We will discuss in Section 10.4.3 how to analyze data to determine whether 
there are any interactions between predictors. 


EXAMPLE 10.1.6 

Suppose we have a population ITI of students at a particular university and are investi- 
gating the relationship between the response Y, given by a student’s grade in calculus, 
and the predictors W and X. The predictor W is the number of hours of academic 
advising given monthly to a student; it can take the values 0, 1, or 2. The predictor X 
indicates class size, where X = 0 indicates small class size and X = 1 indicates large 
class size. So we have a quantitative response Y, a quantitative predictor W taking 
three values, and a categorical predictor X taking two values. The crossed values of 
the predictors (W, X) are given by the set 


{(0, 0), 1,0), 2,0), ©, 1), G, 1), @, D}, 


so there are six treatments. To conduct the experiment, the university then takes a 
random sample of 6n students and randomly assigns n students to each treatment. E 


Chapter 10: Relationships Among Variables 523 


Sometimes we include additional predictors in an experimental design even when 
we are not primarily interested in their effects on the response Y. We do this because we 
know that such a variable has a relationship with Y. Including such predictors allows 
us to condition on their values and so investigate more precisely the relationship Y has 
with the remaining predictors. We refer to such a variable as a blocking variable. 


EXAMPLE 10.1.7 

Suppose the response variable Y is yield of wheat in bushels per acre, and the predictor 
variable X is an indicator variable for which of three types of wheat is being planted in 
an agricultural study. Each type of wheat is going to be planted on a plot of land, where 
all the plots are of the same size, but it is known that the plots used in the experiment 
will vary considerably with respect to their fertility. Note that such an experiment 
is another example of a situation in which it is impossible to randomly sample the 
experimental units (the plots) from the full population of experimental units. 

Suppose the experimenter can group the available experimental units into plots of 
low fertility and high fertility. We call these two classes of fields blocks. Let W indicate 
the type of plot. So W is a categorical variable taking two values. It then seems clear 
that the conditional distributions fy(- |X =x, W = w) will be much less variable than 
the conditional distributions fy(- |X = x). 

In this case, W is serving as a blocking variable. The experimental units in a par- 
ticular block, the one of low fertility or the one of high fertility, are more homogeneous 
than the full set of plots, so variability will be reduced and inferences will be more 
accurate. ll 


Summary of Section 10.1 


e We say two variables are related if the conditional distribution of one given the 
other changes at all, as we change the value of the conditioning variable. 

To conclude that a relationship between two variables is a cause-effect relation- 
ship, we must make sure that (through conditioning) we have taken account of 
all confounding variables. 


Statistics provides a practical way of avoiding the effects of confounding vari- 
ables via conducting an experiment. For this, we must be able to assign the val- 
ues of the predictor variable to experimental units sampled from the population 
of interest. 


e The design of experiments is concerned with determining methods of collecting 
the data so that the analysis of the data will lead to accurate inferences concerning 
questions of interest. 


EXERCISES 


10.1.1 Prove that discrete random variables X and Y are unrelated if and only if X and 
Y are independent. 

10.1.2 Suppose that two variables X and Y defined on a finite population IT are func- 
tionally related as Y = g(X) for some unknown nonconstant function g. Explain how 


524 Section 10.1: Related Variables 


this situation is covered by Definition 10.1.1, i.e., the definition will lead us to conclude 
that X and Y are related. What about the situation in which g(x) = c for some value c 
for every x? (Hint: Use the relative frequency functions of the variables.) 

10.1.3 Suppose that a census is conducted on a population and the joint distribution of 
(X, Y) is obtained as in the following table. 


Y=1 Y=2 Y=3 


X=1 | 0.15 0.18 0.40 
X=2)} 0.12 0.09 0.06 
Determine whether or not a relationship exists between Y and X. 


10.1.4 Suppose that a census is conducted on a population and the joint distribution of 
(X, Y) is obtained as in the following table. 


Fek k2 FES 


X=1| 16 16 153 
X=2 | 1/12 1/12 1/6 


Determine whether or not a relationship exists between Y and X. 


10.1.5 Suppose that X is a random variable and Y = X 2. Determine whether or not X 
and Y are related. What happens when X has a degenerate distribution? 

10.1.6 Suppose a researcher wants to investigate the relationship between birth weight 
and performance on a standardized test administered to children at two years of age. If 
a relationship is found, can this be claimed to be a cause-effect relationship? Explain 
why or why not? 

10.1.7 Suppose a large study of all doctors in Canada was undertaken to determine 
the relationship between various lifestyle choices and lifelength. If the conditional 
distribution of lifelength given various smoking habits changes, then discuss what can 
be concluded from this study. 

10.1.8 Suppose a teacher wanted to determine whether an open- or closed-book exam 
was a more appropriate way to test students on a particular topic. The response variable 
is the grade obtained on the exam out of 100. Discuss how the teacher could go about 
answering this question. 

10.1.9 Suppose a researcher wanted to determine whether or not there is a cause- 
effect relationship between the type of political ad (negative or positive) seen by a 
voter from a particular population and the way the voter votes. Discuss your advice to 
the researcher about how best to conduct the study. 

10.1.10 If two random variables have a nonzero correlation, are they necessarily re- 
lated? Explain why or why not. 

10.1.11 An experimenter wants to determine the relationship between weight change 
Y over a specified period and the use of a specially designed diet. The predictor variable 
X is a categorical variable indicating whether or not a person is on the diet. A total of 
200 volunteers signed on for the study; a random selection of 100 of these were given 
the diet and the remaining 100 continued their usual diet. 

(a) Record the experimental design. 


Chapter 10: Relationships Among Variables 525 


(b) If the results of the study are to be applied to the population of all humans, what 
concerns do you have about how the study was conducted? 

(c) It is felt that the amount of weight lost or gained also is dependent on the initial 
weight W of a participant. How would you propose that the experiment be altered to 
take this into account? 

10.1.12 A study will be conducted, involving the population of people aged 15 to 19 in 
a particular country, to determine whether a relationship exists between the response Y 
(amount spent in dollars in a week on music downloads) and the predictors W (gender) 
and X (age in years). 

(a) If observations are to be taken from every possible conditional distribution of Y 
given the two factors, then how many such conditional distributions are there? 

(b) Identify the types of each variable involved in the study. 

(c) Suppose there are enough funds available to monitor 2000 members of the popula- 
tion. How would you recommend that these resources be allocated among the various 
combinations of factors? 

(d) If a relationship is found between the response and the predictors, can this be 
claimed to be a cause-effect relationship? Explain why or why not. 

(e) Suppose that in addition, it was believed that family income would likely have an 
effect on Y and that families could be classified into low and high income. Indicate 
how you would modify the study to take this into account. 

10.1.13 A random sample of 100 households, from the set of all households contain- 
ing two or more members in a given geographical area, is selected and their television 
viewing habits are monitored for six months. A random selection of 50 of the house- 
holds is sent a brochure each week advertising a certain program. The purpose of 
the study is to determine whether there is any relationship between exposure to the 
brochure and whether or not this program is watched. 

(a) Identify suitable response and predictor variables. 

(b) If a relationship is found, can this be claimed to be a cause-effect relationship? 
Explain why or why not. 

10.1.14 Suppose we have a quantitative response variable Y and two categorical pre- 
dictor variables W and_X, both taking values in {0, 1}. Suppose the conditional distri- 
butions of Y are given by 


Y|W =0,X =0 ~N (3,5) 
Y|W =1,X =0 ~N (3,5) 
Y|W=0,X =1 ~N (4,5) 
Y|W =1,X =1 ~N (4,5). 


Does W have a relationship with Y? Does X have a relationship with Y? Explain your 
answers. 


10.1.15 Suppose we have a quantitative response variable Y and two categorical pre- 
dictor variables W and X, both taking values in {0, 1}. Suppose the conditional distri- 


526 Section 10.1: Related Variables 


butions of Y are given by 


Y|W =0,X =0~N(,5) 
Y|W =1,X =0~NG,5) 
Y|W=0,X=1~N(4,5) 
Y|W=1,X=1~N(4,5). 


Does W have a relationship with Y? Does X have a relationship with Y? Explain your 
answers. 

10.1.16 Do the predictors interact in Exercise 10.1.14? Do the predictors interact in 
Exercise 10.1.15? Explain your answers. 

10.1.17 Suppose we have variables X and Y defined on the population IT = {1,2,..., 
10}, where X(i) = 1 when i is odd and X (i) = 0 when i is even, Y (i) = 1 wheni is 
divisible by 3 and Y (i) = 0 otherwise. 

(a) Determine the relative frequency function of X. 

(b) Determine the relative frequency function of Y. 

(c) Determine the joint relative frequency function of (X, Y). 

(d) Determine all the conditional distributions of Y given X. 

(e) Are X and Y related? Justify your answer. 

10.1.18 A mathematical approach to examining the relationship between variables X 
and Y is to see whether there is a function g such that Y = g(X). Explain why this 
approach does not work for many practical applications where we are examining the 
relationship between variables. Explain how statistics treats this problem. 

10.1.19 Suppose a variable X takes the values | and 2 on a population and the condi- 
tional distributions of Y given X are N(0, 5) when X = 1, and N(0, 7) when X = 2. 
Determine whether X and Y are related and if so, describe their relationship. 

10.1.20 A variable Y has conditional distribution given X specified by N (1 + 2x, |x|) 
when X = x. Determine if X and Y are related and if so, describe what their relation- 
ship is. 

10.1.21 Suppose that Y ~ Uniform[—1,1] and Y = X?. Determine the correlation 
between Y and X. Are X and Y related? 


PROBLEMS 


10.1.22 If there is more than one predictor involved in an experiment, do you think 
it is preferable for the predictors to interact or not? Explain your answer. Can the 
experimenter control whether or not predictors interact? 

10.1.23 Prove directly, using Definition 10.1.1, that when X and Y are related variables 
defined on a finite population I, then Y and X are also related. 

10.1.24 Suppose that X, Y, Z are independent N (0, 1) random variables and that U = 
X+Z,V = Y + Z. Determine whether or not the variables U and V are related. (Hint: 
Calculate Cov(U, V) .) 

10.1.25 Suppose that (X, Y, Z) ~ Multinomial(n, 1/3, 1/3,1/3). Are X and Y re- 
lated? 


Chapter 10: Relationships Among Variables 527 


10.1.26 Suppose that (X, Y) ~ Bivariate-Normal (41, “2,01, 02, p). Show that X and 
Y are unrelated if and only if Corr(X, Y) = 0. 

10.1.27 Suppose that (X, Y, Z) have probability function py y,z. If Y is related to X 
but not to Z, then prove that px,y, zœ, y,z) = py|x(v |x) pxiz@ |z)pzæ). 


10.2 | Categorical Response and Predictors 


There are two possible situations when we have a single categorical response Y and a 
single categorical predictor X. The categorical predictor is either random or determin- 
istic, depending on how we sample. We examine these two situations separately. 


10.2.1 Random Predictor 


We consider the situation in which X is categorical, taking values in {1,..., a}, and 
Y is categorical, taking values in {1,..., b}. If we take a sample z1,..., 2, from the 
population, then the values Y(z;) = x; are random, as are the values Y(z;) = yj. 

Suppose the sample size n is very small relative to the population size (so we can 
assume that i.i.d. sampling is applicable). Then, letting 6;; = P(X =i, Y = j), we 
obtain the likelihood function (see Problem 10.2.15) 


a b 
L(011,...,0ab | 1, y1); -- -> Ons Yn)) = IIIo. (10.2.1) 
i=l j=l 


where fij is the number of sample values with (X, Y) = (i, j). An easy computation 
(see Problem 10.2.16) shows that the MLE of (011, ..., Okr) is given by bi; = fij/n 
and that the standard error of this estimate (because the incidence of a sample member 
falling in the (i, 7)-th cell is distributed Bernoulli(@;;) and using Example 6.3.2) is 


given by 
[Ô — i) 
Set 


We are interested in whether or not there is a relationship between X and Y. To 
answer this, we look at the conditional distributions of Y given X. The conditional 
distributions of Y given X, using 6;. = 0; +---+60ip = P(X = i), are given in the 
following table. 


015/01. 


Oab/0a. 


528 Section 10.2: Categorical Response and Predictors 


Then estimating 0;; /0;. by 0;;/0;. = fij/fi.. where fi. = fii +--+ fib, the estimated 
conditional distributions are as in the following table. 


falfe = fio/fi. 


Fife se le 


If we conclude that there is a relationship between X and Y, then we look at the table of 
estimated conditional distributions to determine the form of the relationship, i.e., how 
the conditional distributions change as we change the value of X we are conditioning 
on. 

How, then, do we infer whether or not a relationship exists between X and Y? 
No relationship exists between Y and_X if and only if the conditional distributions of 
Y given X = x do not change with x. This is the case if and only if X and Y are 
independent, and this is true if and only if 


bij = P(X =i, Y = j) = P(X = i)P = j) =6,.8,;, 


for every i and j where 0.; = 0); +---+60q; = P(Y = j). Therefore, to assess 
whether or not there is a relationship between X and Y, it is equivalent to assess the 
null hypothesis Ho : 6;; = 0;.0.; for every i and j. 

How should we assess whether or not the observed data are surprising when Ho 
holds? The methods of Section 9.1.2, and in particular Theorem 9.1.2, can be applied 
here, as we have that 


(Fit, Fi2, ..., Fab) ~ Multinomial(n, 01.0.1, 01.0.2, ...,0a-0.5) 


when Ho holds, where F;; is the count in the (i, j)-th cell. 
To apply Theorem 9.1.2, we need the MLE of the parameters of the model under 
Ho. The likelihood, when Ho holds, is 


a b 
L@1.,-++58a-50.15+++50-61 1,1) >--+3 Ons In) =] TY] 0:0). 0.2.2) 
i=l j=1 
From this, we deduce (see Problem 10.2.17) that the MLE’s of the 6;. and @.; are given 
by ĝi. = fi./n and a. j = f;/n. Therefore, the relevant chi-squared statistic is 


Under Ho, the parameter space has dimension (a — 1) + (b — 1) =a + b — 2, so we 
compare the observed value of X? with the y7((a — 1) (b — 1)) distribution because 
ab-1—a-—b+2=(a-1)(6-1). 

Consider an example. 


Chapter 10: Relationships Among Variables 529 


EXAMPLE 10.2.1 Piston Ring Data 
The following table gives the counts of piston ring failures, where variable Y is the 
compressor number and variable X is the leg position based on a sample of n = 166. 
These data were taken from Statistical Methods in Research and Production, by O. L. 
Davies (Hafner Publishers, New York, 1961). 

Here, Y takes four values and X takes three values (N = North, C = Central, and S 
= South). 


es [E= Y= See ee =A 


The question of interest is whether or not there is any relation between compressor and 
leg position. Because fi. = 53, f2. = 41, and f3. = 72, the conditional distributions 
of Y given X are estimated as in the rows of the following table. 


17/53 =0.321 11/53 = 0.208 11/53 =0.208 14/53 = 0.264 
17/41=0415 9/41 =0.222 8/41=0.195 7/41 =0.171 
12/72 = 0.167 13/72 = 0.181 19/72 = 0.264 28/72 = 0.389 


Comparing the rows, it certainly looks as if there is a difference in the conditional 
distributions, but we must assess whether or not the observed differences can be ex- 
plained as due to sampling error. To see if the observed differences are real, we carry 
out the chi-squared test. 

Under the null hypothesis of independence, the MLE’s are given by 


for the X probabilities. Then the estimated expected counts n6;.0. j are given by the 
following table. 


14.6867 10.5361 12.1325 15.6446 
11.3614 8.1506 9.3855 12.1024 
19.9518 14.3133 16.4819 21.2530 


are as in the following table. 


530 Section 10.2: Categorical Response and Predictors 


CO rsi re Fe Foa] 


0.6322 0.1477 —0.3377 —0.4369 


1.7332 0.3051 —0.4656 —1.5233 
—1.8979 —0.3631 0.6536 1.5673 


All of the standardized residuals seem reasonable, and we have that X 2 = 11.7223 
with P(77(6) > 11.7223) = 0.0685, which is not unreasonably small. 

So, while there may be some indication that the null hypothesis of no relationship 
is false, this evidence is not overwhelming. Accordingly, in this case, we may assume 
that Y and X are independent and use the estimates of cell probabilities obtained under 
this assumption. E 


We must also be concerned with model checking, i.e., is the model that we have as- 
sumed for the data (x1, y1), .--., Œn, Yn) correct? If these observations are i.i.d., then 
indeed the model is correct, as that is all that is being effectively assumed. So we need 
to check that the observations are a plausible i.i.d. sample. Because the minimal suffi- 
cient statistic is given by (/11,..., fab), such a test could be based on the conditional 
distribution of the sample (x1, v1), ..., Qn, Yn) given (fil, ..., fab). The distribution 
theory for such tests is computationally difficult to implement, however, and we do not 
pursue this topic further in this text. 


10.2.2 | Deterministic Predictor 


Consider again the situation in which X is categorical, taking values in {1,..., a}, and 
Y is categorical, taking values in {1,..., b}. But now suppose that we take a sample 
1,..., Tn from the population, where we have specified that n; sample members have 
the value X = i, etc. This could be by assignment, when we are trying to determine 
whether a cause-effect relationship exists; or we might have a populations IT,,..., I, 
and want to see whether there is any difference in the distribution of Y between popu- 
lations. Note that ny +---+ng =n. 

In both cases, we again want to make inferences about the conditional distributions 
of Y given X as represented by the following table. 


A difference in the conditional distributions means there is a relationship between Y 
and X. If we denote the number of observations in the ith sample that have Y = j 
by fij, then assuming the sample sizes are small relative to the population sizes, the 
likelihood function is given by 


a b 


Li x=1, -- -> 9o,x=a | 1, V1), -- -> Ons Yn)) = I] Ojai), (10.2.3) 
i=l j=1 


Chapter 10: Relationships Among Variables 531 


and the MLE is given by 6 j\x=i = fi; /ni (Problem 10.2.18). 
There is no relationship between Y and X if and only if the conditional distributions 
do not vary as we vary X, or if and only if 


Ao: Oj\xa1 = +++ = Oj|X=a = 9; 


for all j =1,..., b for some probability distribution 0;,...,0,. Under Ho, the likeli- 
hood function is given by 


b 
E E E a (10.2.4) 
j=l 


and the MLE of 0; is given by ô j = f.j/n (see Problem 10.2.19). Then, applying 
Theorem 9.1.2, we have that the statistic 


Su =m 
yew ij — ibj) A 


i=l j= 


has an approximate y? ((a — 1) (b — 1)) distribution under Ho because there are a(b — 1) 
free parameters in the full model, (b — 1) parameters in the independence model, and 
a(b — 1) — (b —- 1) = (a — 1)(b — 1). 


Consider an example. 


EXAMPLE 10.2.2 

This example is taken from a famous applied statistics book, Statistical Methods, 6th 
ed., by G. Snedecor and W. Cochran (Iowa State University Press, Ames, 1967). In- 
dividuals were classified according to their blood type Y (O, A, B, and AB, although 
the AB individuals were eliminated, as they were small in number) and also classified 
according to X, their disease status (peptic ulcer = P, gastric cancer = G, or control = 
C). So we have three populations; namely, those suffering from a peptic ulcer, those 
suffering from gastric cancer, and those suffering from neither. We suppose further 
that the individuals involved in the study can be considered as random samples from 
the respective populations. 

The data are given in the following table. 


|7 =0 Y=A Y=B Total 


983 679 134 1796 
383 416 84 883 
2892 2625 570 6087 


983/1796 = 0.547 679/1796 = 0.378 134/1796 = 0.075 
383/883 = 0.434 416/883 = 0.471 84/883 = 0.095 
2892/6087 = 0.475 2625/6087 = 0.431 570/6087 = 0.093 


532 Section 10.2: Categorical Response and Predictors 


We now want to assess whether or not there is any evidence for concluding that a 
difference exists among these conditional distributions. Under the null hypothesis that 
no difference exists, the MLE’s of the probabilities 01 = P(Y =O), 02 = P(Y =A), 
and 03 = P(Y =B) are given by 


i 983 + 383 + 2892 
Ge es, aA AA 

i 1796 4883 +6087 ~ 04897) 
416 + 262 
Ai, RBA Ale yA, 

1796 + 883 + 6087 

P 134 + 84 + 570 
6, = ee a E need. 


1796 + 883 + 6087 


Then the estimated expected counts nj0 j are given by the following table. 


872.3172 762.2224 161.4604 


428.8731 374.7452 79.3817 
X =C | 2956.4559 2583.3228 547.2213 


The standardized residuals (using (9.1.6)) (fi; — ni0;)/(niO — a) ee are given by 
the following table. 
Y=A Y=B 
5.2219 -—3.9705 —2.2643 


—3.0910 2.8111 0.5441 
—1.6592 1.0861 1.0227 


We have that X? = 40.5434 and P(y7(4) > 40.5434) = 0.0000, so we have strong 
evidence against the null hypothesis of no relationship existing between Y and X. Ob- 
serve the large residuals when X = P and Y =O, Y =A. 

We are left with examining the conditional distributions to ascertain what form 
the relationship between Y and X takes. A useful tool in this regard is to plot the 
conditional distributions in bar charts, as we have done in Figure 10.2.1. From this, we 
see that the peptic ulcer population has a greater proportion of blood type O than the 
other populations. 


04 4 
0.3 4 
0.24 
017 
oo | 


T T T T T T 
Y=0 Y=A Y=BY=0 Y=A Y=B 
X=P =G 


T T T 
Y Y=0 Y=A Y=B 
X X=C 


Figure 10.2.1: Plot of the conditional distributions of Y, given X, in Example 10.2.2. 


Chapter 10: Relationships Among Variables 533 


10.2.3 | Bayesian Formulation 


We now add a prior density z for the unknown values of the parameters of the models 
discussed in Sections 10.2.1 and 10.2.2. Depending on how we choose z, and de- 
pending on the particular computation we want to carry out, we could be faced with 
some difficult computational problems. Of course, we have the Monte Carlo methods 
available in such circumstances, which can often render a computation fairly straight- 
forward. 

The most common choice of prior in these circumstances is to choose a conjugate 
prior. Because the likelihoods discussed in this section are as in Example 7.1.3, we see 
immediately that Dirichlet priors will be conjugate for the full model in Section 10.2.1 
and that products of independent Dirichlet priors will be conjugate for the full model 
in Section 10.2.2. 

In Section 10.2.1, the general likelihood — i.e., no restrictions on the 0;; — is of 
the form 


a b 
L@u, <<- Oap | (x1, y1), PS ED, (Xn, Yn)) =|[]] 7. 
i=l j=1 


If we place a Dirichlet(a@ 11, ..., @qp) prior on the parameter, then the posterior density 
is proportional to 


ee es 
ij+aij— 
Mo 
i=! jal 
so the posterior is a Dirichlet(/}; + @11,.--, fab + &ap) distribution. 


In Section 10.2.2, the general likelihood is of the form 


a b 
LO \x=1, -- +5 Fb,x=a | 1, y1); -- -> Ons Yn)) = BICA 
E 


Because J Ojx=i = 1 for each i = 1,...,a, we must place a prior on each 
distribution (01| x=i, ..., Obj x=i). If we choose the prior on the ith distribution to be 
Dirichlet(a@1\;, ..., daji), then the posterior density is proportional to 


a b 
Mes" 


i=l j=l 


We recognize this as the product of independent Dirichlet distributions, with the poste- 
rior distribution on Gi X=i>-.-, Ob] x=i) equal to a 


Dirichlet(fj1 + a&i, -- -> fib + abli) 


distribution. 
A special and important case of the Dirichlet priors corresponds to the situation in 
which we feel that we have no information about the parameter. In such a situation, it 


534 Section 10.2: Categorical Response and Predictors 


makes sense to choose all the parameters of the Dirichlet to be 1, so that the priors are 
all uniform. 

There are many characteristics of a Dirichlet distribution that can be evaluated in 
closed form, e.g., the expectation of any polynomial (see Problem 10.2.20). But still 
there will be many quantities for which exact computations will not be available. It 
turns out that we can always easily generate samples from Dirichlet distributions, pro- 
vided we have access to a generator for beta distributions. This is available with most 
statistical packages. We now discuss how to do this. 


EXAMPLE 10.2.3 Generating from a Dirichlet(a,, ..., ap) Distribution 
The technique we discuss here is a commonly used method for generating from multi- 
variate distributions. If we want to generate a value of the random vector (X1,..., X), 
then we can proceed as follows. First, generate a value x; from the marginal distrib- 
ution of X1. Next, generate a value x2 from the conditional distribution of X2 given 
Xı = x1. Then generate a value x3 from the conditional distribution of X3, given that 
X, =x, and X? = x2, etc. 

If the distribution of X is discrete, then we have that the probability of a particular 
vector of values (x1, x2, ..., Xx) arising via this scheme is 


P(X, =x1)P(X2 = X2|X1 = x1) +++ P(Xk = XK | X1 = x1, ..., Xk-1 = XK-1)- 
Expanding each of these conditional probabilities, we obtain 
P(X, = x) Pets 2=x2) o.. PEK Xk 1 EX KKH) 


P(X1=x1) P(X1=x1,...,Xk-1=xk-1) ? 
which equals P(X, = x1,..., Xk-1 = Xk-1, Xk = XK), and so (x1, %2,..., Xk) is 
a value from the joint distribution of (X1, ..., Xx). This approach also works for ab- 


solutely continuous distributions, and the proof is the same but uses density functions 
instead. 

In the case of (X1, ..., Xk—-1) ~ Dirichlet(a1, ..., ak), we have that (see Chal- 
lenge 10.2.23) X1 ~ Beta(a1, a2 +---+a,x) and X; given Xj = x1, ..., Xi-1 = Xi-1 
has the same distribution as (1 — xı — - - - — xi—1)U;, where 


U; ~ Beta(a;, ai41 ++: + ak) 


and U2,...,Uķ—ı are independent. Note that X = 1 — Xı — --- — Xk-1ı for any 
Dirichlet distribution. So we generate X; ~ Beta(a1, a2 +- -- + ak), generate Uz ~ 
Beta(a2, a3 +---+a,) and put X2 = (1 —X1)U2, generate U3 ~ Beta(a3, a4 +: -<+ 
ax) and put X3 = (1 — X1 — X2)U3, etc. 

Below, we present a table of a sample of n = 5 values from a Dirichlet(2, 3, 1, 1.5) 
distribution. 


0.116159 0.585788 0.229019 0.069034 
0.166639 0.566369 0.056627 0.210366 


0.411488 0.183686 0.326451 0.078375 
0.483124 0.316647 0.115544 0.084684 
0.117876 0.147869 0.418013 0.316242 


Appendix B contains the code used for this. It can be modified to generate from any 
Dirichlet distribution. E 


Chapter 10: Relationships Among Variables 535 


Summary of Section 10.2 


e In this section, we have considered the situation in which we have a categorical 
response variable and a categorical predictor variable. 


e We distinguished two situations. The first arises when the value of the predictor 
variable is not assigned, and the second arises when it is. 


e In both cases, the test of the null hypothesis that no relationship exists involved 
the chi-squared test. 


EXERCISES 


10.2.1 The following table gives the counts of accidents for two successive years in a 


particular city. 
| | June July August 
Year 1 | 60 100 80 


Year 2 80 100 60 


Is there any evidence of a difference in the distribution of accidents for these months 
between the two years? 

10.2.2 The following data are from a study by Linus Pauling (1971) (“The significance 
ofthe evidence about ascorbic acid and the common cold,” Proceedings of the National 
Academy of Sciences, Vol. 68, p. 2678), concerned with examining the relationship 
between taking vitamin C and the incidence of colds. Of 279 participants in the study, 
140 received a placebo (sugar pill) and 139 received vitamin C. 


| [| NoCold Cold 


Placebo 31 109 


Vitamin C 17 122 


Assess the null hypothesis that there is no relationship between taking vitamin C and 
the incidence of the common cold. 

10.2.3 A simulation experiment is carried out to see whether there is any relationship 
between the first and second digits of a random variable generated from a Uniform[0, 1] 
distribution. A total of 1000 uniforms were generated; if the first and second digits were 
in {0, 1, 2, 3, 4} they were recorded as a 0, and as a 1 otherwise. The cross-classified 
data are given in the following table. 


| | Second digit0 Second digit 1 


First digit 0 240 250 


First digit 1 255 255 


Assess the null hypothesis that there is no relationship between the digits. 
10.2.4 Grades in a first-year calculus course were obtained for randomly selected stu- 
dents at two universities and classified as pass or fail. The following data were ob- 


tained. 
Fail Pass 


University 1 | 33 = 143 
University 2 | 22 263 


536 Section 10.2: Categorical Response and Predictors 


Is there any evidence of a relationship between calculus grades and university? 

10.2.5 The following data are recorded in Statistical Methods for Research Workers, 
by R. A. Fisher (Hafner Press, New York, 1922), and show the classifications of 3883 
Scottish children by gender (X) and hair color (Y). 


Y =fair Y=red Y=medium Y=dark Y =jet black 
592 119 849 


544 97 677 


(a) Is there any evidence for a relationship between hair color and gender? 

(b) Plot the appropriate bar chart(s). 

(c) Record the residuals and relate these to the results in parts (a) and (b). What do you 
conclude about the size of any deviation from independence? 

10.2.6 Suppose we have a controllable predictor X that takes four different values, and 
we measure a binary-valued response Y. A random sample of 100 was taken from the 
population and the value of X was randomly assigned to each individual in such a way 
that there are 25 sample members taking each of the possible values of X. Suppose that 
the following data were obtained. 


X=2 X=3 X=4 
16 14 


9 11 


(a) Assess whether or not there is any evidence against a cause-effect relationship 
existing between X and Y. 

(b) Explain why it is possible in this example to assert that any evidence found that a 
relationship exists is evidence that a cause-effect relationship exists. 

10.2.7 Write out in full how you would generate a value from a Dirichlet(1, 1, 1, 1) 
distribution. 

10.2.8 Suppose we have two categorical variables defined on a population II and we 
conduct a census. How would you decide whether or not a relationship exists between 
X and Y? If you decided that a relationship existed, how would you distinguish between 
a strong and a weak relationship? 

10.2.9 Suppose you simultaneously roll two dice n times and record the outcomes. 
Based on these values, how would you assess the null hypothesis that the outcome on 
each die is independent of the outcome on the other? 

10.2.10 Suppose a professor wants to assess whether or not there is any difference 
in the final grade distributions (A, B, C, D, and F) between males and females in a 
particular class. To assess the null hypothesis that there is no difference between these 
distributions, the professor carries out a chi-squared test. 

(a) Discuss how the professor carried out this test. 

(b) If the professor obtained evidence against the null hypothesis, discuss what con- 
cerns you have over the use of the chi-squared test. 

10.2.11 Suppose that a chi-squared test is carried out, based on a random sample of 
n from a population, to assess whether or not two categorical variables X and Y are 


Chapter 10: Relationships Among Variables 537 


independent. Suppose the P-value equals 0.001 and the investigator concludes that 
there is evidence against independence. Discuss how you would check to see if the 
deviation from independence was of practical significance. 


PROBLEMS 


10.2.12 In Example 10.2.1, place a uniform prior on the parameters (a Dirichlet distri- 
bution with all parameters equal to 1) and then determine the posterior distribution of 
the parameters. 
10.2.13 In Example 10.2.2, place a uniform prior on the parameters of each population 
(a Dirichlet distribution with all parameters equal to 1) and such that the three priors 
are independent. Then determine the posterior distribution. 
10.2.14 Ina2 x 2 table with probabilities 6;;, prove that the row and column variables 
are independent if and only if 

011022 i 

012021 
namely, we have independence if and only if the cross-ratio equals 1. 
10.2.15 Establish that the likelihood in (10.2.1) is correct when the population size is 
infinite (or when we are sampling with replacement from the population). 
10.2.16 (MV) Prove that the MLE of (8011, ...,Oab) in (10.2.1) is given by 6; = 
fij/n. Assume that fj; > 0 for every i, j. (Hint: Use the facts that a continuous 
function on this parameter space Q must achieve its maximum at some point in Q 
and that, if the function is continuously differentiable at such a point, then all its first- 
order partial derivatives are zero there. This will allow you to conclude that the unique 
solution to the score equations must be the point where the log-likelihood is maximized. 
Try the case where a = 2, b = 2 first.) 
10.2.17 (MV) Prove that the MLE of (@1.,...,0a.,9.1,...,9.») in (10.2.2) is given 
by ĝi. = fi./n and 6; = f.j/n. Assume that fj. > 0, fj > 0 for every i, j. (Hint: 
Use the hint in Problem 10.2.16.) 
10.2.18 (MV) Prove that the MLE of (01,x=1,...,9s,x=a) in (10.2.3) is given by 
6 j\X=i = fij/ni. Assume that fj; > 0 for every i, j. (Hint: Use the hint in Problem 
10.2.16.) 
10.2.19 (MV) Prove that the MLE of (01, ..., 5) in (10.2.4) is given by 6; = fj/n. 
Assume that f.; > 0 for every i, j. (Hint: Use the hint in Problem 10.2.16.) 
10.2.20 Suppose that X = (X1,..., Xk—-1) ~ Dirichlet(a1, ..., ap). Determine 
E(X! . 2a) in terms of the gamma function, when /; > 0 fori =1,...,k. 


COMPUTER PROBLEMS 


10.2.21 Suppose that (01, 2, 63,64) ~ Dirichlet(1, 1,1, 1), as in Exercise 10.2.7. 
Generate a sample of size N = 104 from this distribution and use this to estimate the 
expectations of the 6;. Compare these estimates with their exact values. (Hint: There 
is some relevant code in Appendix B for the generation; see Appendix C for formulas 
for the exact values of these expectations.) 


538 Section 10.3: Quantitative Response and Predictors 


10.2.22 For Problem 10.2.12, generate a sample of size N = 104 from the posterior 
distribution of the parameters and use this to estimate the posterior expectations of the 
cell probabilities. Compare these estimates with their exact values. (Hint: There is 
some relevant code in Appendix B for the generation; see Appendix C for formulas for 
the exact values of these expectations.) 


CHALLENGES 


10.2.23 (MV) Establish the validity of the method discussed in Example 10.2.3 for 
generating from a Dirichlet(a1, ..., ax) distribution. 


10.3 | Quantitative Response and Predictors 


When the response and predictor variables are all categorical, it can be difficult to for- 
mulate simple models that adequately describe the relationship between the variables. 
We are left with recording the conditional distributions and plotting these in bar charts. 
When the response variable is quantitative, however, useful models have been formu- 
lated that give a precise mathematical expression for the form of the relationship that 
may exist. We will study these kinds of models in the next three sections. This section 
concentrates on the situation in which all the variables are quantitative. 


10.3.1 | The Method of Least Squares 


The method of least squares is a general method for obtaining an estimate of a distribu- 
tion mean. It does not require specific distributional assumptions and so can be thought 
of as a distribution-free method (see Section 6.4). 

Suppose we have a random variable Y, and we want to estimate E (Y) based on a 
sample (y1, .. ., Yn). The following principle is commonly used to generate estimates. 


The least-squares principle says that we select the point t (y1, ..., Yn), 
in the set of possible values for E (Y), that minimizes the sum of squared 
deviations (hence, “least squares”) given by >"7_)(vi — (01, ---, yay: 
Such an estimate is called a least-squares estimate. 


Note that a least-squares estimate is defined for every sample size, even n = 1. 
To implement least squares, we must find the minimizing point t (y1, ..., Yn). Per- 
haps a first guess at this value is the sample average y. Because >", (vi — VY) — 


tO,- WMD) =O -t1,.--, mn) OL Vi — ny) = 0, we have 


DO S) = DOF — PHI a IY? 
i=l i=l 

= X 0- +2 0i- PO tO. nD + DG tO. Vn 
i=l i=l i= 


= X 0- PP Hn Oton). (10.3.1) 


Chapter 10: Relationships Among Variables 539 


Therefore, the smallest possible value of (10.3.1) is $7; Qi — y), and this is assumed 
by taking t(1,..., Yn) = y. Note, however, that y might not be a possible value for 
E(Y) and that, in such a case, it will not be the least-squares estimate. In general, 
(10.3.1) says that the least-squares estimate is the value t(1, ..., Yn) that is closest to 
y and is a possible value for E (Y). 

Consider the following example. 


EXAMPLE 10.3.1 
Suppose that Y has one of the distributions on S = {0, 1} given in the following table. 


y=0 y=l 


ag} 1/2 17 
PY) | 1⁄3 2 


Then the mean of Y is given by 


1 1 1 1 2 2 


Now suppose we observe the sample (0,0, 1,1, 1) and so y = 3/5. Because the 
possible values for £ (Y) are in {1/2, 2/3}, we see that ¢(0, 0, 1, 1, 1) = 2/3 because 
(3/5 — 2/3)? = 0.004 while (3/5 — 1/2)? = 0.01.8 


Whenever the set of possible values for E(Y) is an interval (a, b), however, and 
P(Y e(a,b)) = 1, then y e (a, b). This implies that y is the least-squares estimator 
of E (Y). So we see that in quite general circumstances, y is the least-squares estimate. 

There is an equivalence between least squares and the maximum likelihood method 
when we are dealing with normal distributions. 


EXAMPLE 10.3.2 Least Squares with Normal Distributions 
Suppose that (y1, ... , Yn) is a sample from an N (u, o?) distribution, where u is un- 
known. Then the MLE of u is obtained by finding the value of u that maximizes 


n = 
L(w | yi, -3 Yn) = exp} -5 O- u’ i. 
205 
Equivalently, the MLE maximizes the log-likelihood 


n 
l EE E E E 
(uly Yn) DT u) 


So we need to find the value of u that minimizes (y — wW? just as with least squares. 

In the case of the normal location model, we see that the least-squares estimate and 
the MLE of 0 agree. This equivalence is true in general for normal models (e.g., the 
location-scale normal model), at least when we are considering estimates of location 
parameters. E 


Some of the most important applications of least squares arise when we have that 
the response is a random vector Y = (Y1, ..., Y p) € R” (the prime ’ indicates that 


540 Section 10.3: Quantitative Response and Predictors 


we consider Y as a column), and we observe a single observation y = (y1, ..., Yn) € 
R”. The expected value of Y €e R” is defined to be the vector of expectations of its 
component random variables, namely, 


EY) 
EY) = : e R". 
E(Y,) 


The least-squares principle then says that, based on the single observation y = (y1, ..., 
Yn), we must find 


t(y) =t(V1,.--;¥n) = Hise sce Ya -s n1 e Yn), 


in the set of possible values for E (Y) (a subset of R”), that minimizes 


S01 -tOn -.- yn)’. (10.3.2) 
i=l 


So t(y) is the possible value for E(Y) that is closest to y, as the squared distance 
between two points x, y € R” is given by X (xi — yi)’. 

As is common in statistical applications, suppose that there are predictor variables 
that may be related to Y and whose values are observed. In this case, we will replace 
E(Y) by its conditional mean, given the observed values of the predictors. The least- 
squares estimate of the conditional mean is then the value t (y1, .. . , Yn), in the set of 
possible values for the conditional mean of Y, that minimizes (10.3.2). We will use this 
definition in the following sections. 

Finding the minimizing value of t (y) in (10.3.2) can be a challenging optimization 
problem when the set of possible values for the mean is complicated. We will now 
apply least squares to some important problems where the least-squares solution can 
be found in closed form. 


10.3.2 | The Simple Linear Regression Model 


Suppose we have a single quantitative response variable Y and a single quantitative 
predictor X, e.g., Y could be blood pressure measured in pounds per square inch and 
X could be age in years. To study the relationship between these variables, we examine 
the conditional distributions of Y, given X = x, to see how these change as we change 
x. 

We might choose to examine a particular characteristic of these distributions to see 
how it varies with x. Perhaps the most commonly used characteristic is the conditional 
mean of Y given X = x, or E (Y | X = x) (see Section 3.5). 

In the regression model (see Section 10.1), we assume that the conditional distrib- 
utions have constant shape and that they change, as we change x, at most through the 
conditional mean. In the simple linear regression model, we assume that the only way 
the conditional mean can change is via the relationship 


E(Y |X =x) = $; + Box, 


Chapter 10: Relationships Among Variables 541 


for some unknown values of 8, € R! (the intercept term) and £) € R! (the slope 
coefficient). We also refer to £; and £, as the regression coefficients. 

Suppose we observe the independent values (x1, y1),..., (Xn, Yn) for (X, Y). Then, 
using the simple linear regression model, we have that 


Yı By + Box1 
E : X, =X1,...,X, =X | = : ; (10.3.3) 
Yn By a BoXn 
Equation (10.3.3) tells us that the conditional expected value of the response 
(Y,..., Yn)’ is in a particular subset of R”. Furthermore, (10.3.2) becomes 
n n 
Gi - 40)" = D201 - Bi - Boxi)’, (10.3.4) 
i=l i=l 


and we must find the values of £; and f, that minimize (10.3.4). These values are 
called the least-squares estimates of 6, and p3. 
Before we show how to do this, consider an example. 


EXAMPLE 10.3.3 
Suppose we obtained the following n = 10 data points (x;, y;). 


(3.9,8.9) (2.6, 7.1) (2.4, 4.6) (4.1, 10.7) (—0.2, 1.0) 


(5.4,12.6) (0.6,3.3) (—5.6,—10.4) (—1.1,—2.3) 2.4, —1.6) 


In Figure 10.3.1, we have plotted these points together with the line y = 1 + x. 


T T T 
5 0 5 


x 


Figure 10.3.1: A plot of the data points (x;, yj) (+) and the line y = 1 + x in Example 10.3.3. 


Notice that with 6, = 1 and $, = 1, then 


(vi — By = baxi) =Q; -1-x;) 


542 Section 10.3: Quantitative Response and Predictors 


is the squared vertical distance between the point (x;, y;) and the point on the line with 
the same x value. So (10.3.4) is the sum of these squared deviations and in this case 
equals 


(8.9 — 1 — 3.9)? + (7.1 — 1 — 2.6) +--+ (-1.6 — 142.1)? = 141.15. 


If 8; = 1 and £, = 1 were the least-squares estimates, then 141.15 would be equal to 
the smallest possible value of (10.3.4). In this case, it turns out (see Example 10.3.4) 
that the least-squares estimates are given by the values f} = 1.33, £a = 2.06, and the 
minimized value of (10.3.4) is given by 8.46, which is much smaller than 141.15. 

So we see that, in finding the least-squares estimates, we are in essence finding 
the line 6, + 2x that best fits the data, in the sense that the sum of squared vertical 
deviations of the observed points to the line is minimized. E 


Scatter Plots 


As part of Example 10.3.3, we plotted the points (x1, y1),..-, Œn, Yn) ina graph. This 
is called a scatter plot, and it is a recommended first step as part of any analysis of the 
relationship between quantitative variables X and Y. A scatter plot can give us a very 
general idea of whether or not a relationship exists and what form it might take. 

It is important to remember, however, that the appearance of such a plot is highly 
dependent on the scales we choose for the axes. For example, we can make a scatter 
plot look virtually flat (and so indicate that no relationship exists) by choosing to place 
too wide a range of tick marks on the y-axis. So we must always augment a scatter plot 
with a statistical analysis based on numbers. 


Least-Squares Estimates, Predictions, and Standard Errors 


For the simple linear regression model, we can work out exact formulas for the least- 
squares estimates of f,and 23. 


Theorem 10.3.1 Suppose that E (Y | X = x) = £,+/>x, and we observe the inde- 
pendent values (x1, ¥1),..-, Œn, Yn) for (X, Y). Then the least-squares estimates 
of 6, and £, are given by 


a G3) 


bi =y—box and b2 = 


respectively, whenever >7_, (x; — ¥ YP £0. 


PROOF | The proof of this result can be found in Section 10.6. B 


We call the line y = bı +b2x the least-squares line, or best-fitting line, and b1 +b2x 
is the least-squares estimate of E(Y | X = x). Note that }°?_) Qi — 0 = 0 if and 
only if xj = --- = xn. In such a case we cannot use least squares to estimate f4 and 
2, although we can still estimate E (Y | X = x) (see Problem 10.3.19). 


Chapter 10: Relationships Among Variables 543 


Now that we have estimates b;, b2 of the regression coefficients, we want to use 
these for inferences about £; and £2. These estimates have the unbiasedness property. 


Theorem 10.3.2 If E(Y |X = x) = fı + Box, and we observe the independent 
values (x1, ¥1),--- 5 (Xn, Yn) for (X, Y), then 


OWEN Dee pee rst 
(i) ECB Gn eee 


PROOF | The proof of this result can be found in Section 10.6. E 


Note that Theorem 10.3.2 and the theorem of total expectation imply that E (B1) = £1 
and E(B2) = f, unconditionally as well. 

Adding the assumption that the conditional variances exist, we have the following 
theorem. 


Theorem 10.3.3 If E(Y |X = x) = By + Box, Var(Y | X =x) = o° for every x, 
and we observe the independent values (x1, y1),..., Œn, Yn) for (X, Y), then 


O Vari |X = 21, 0... Xn = an) = on a — 2), 
(ii) Var( Bo | Xp = x1, 10.5 Xn = %_) =07/ > 2 103 — x), 
Gii) Cov(B1, Bo |X1 =2x1,...,Xn = Xn) = —0°X/ 7 Gi — x)’. 


PROOF | See Section 10.6 for the proof of this result. E 


For the least-squares estimate bı + b2x of the mean £ (Y |X = x) = $i + fox, we 
have the following result. 


Corollary 10.3.1 


ND 
OG oS 
Var(B, + Box | X1 =%1,...,Xn -w = > (10.3.5) 
i= 


PROOF | See Section 10.6 for the proof of this result. E 


A natural predictor of a future value of Y, when X = x, is given by the conditional 
mean E (Y | X = x) = fı + Bx. Because we do not know the values of fand p3, we 
use the estimated mean bı + box as the predictor. 

When we are predicting Y at an x value that lies within the range of the observed 
values of X, we refer to this as an interpolation. When we want to predict at an x 
value that lies outside this range, we refer to this as an extrapolation. Extrapolations 
are much less reliable than interpolations. The farther away x is from the observed 
range of X values, then, intuitively, the less reliable we feel such a prediction will be. 
Such considerations should always be borne in mind. From (10.3.5), we see that the 
variance of the prediction at the value X = x increases as x moves away from x. So to 
a certain extent, the standard error does reflect this increased uncertainty, but note that 
its form is based on the assumption that the simple linear regression model is correct. 


544 Section 10.3: Quantitative Response and Predictors 


Even if we accept the simple linear regression model based on the observed data (we 
will discuss model checking later in this section), this model may fail to apply for very 
different values of x, and so the predictions would be in error. 

We want to use the results of Theorem 10.3.3 and Corollary 10.3.1 to calculate 
standard errors of the least-squares estimates. Because we do not know a2, however, 
we need an estimate of this quantity as well. The following result shows that 


1 
n—2 


s = 


> Oi = bi = box)? (10.3.6) 
i=l 


is an unbiased estimate of 2. 


Theorem 10.3.4 If E(Y |X = x) = fı + Box, Var(Y |X =x) =o? for every x, 
and we observe the independent values (x1, y1),..., Œn, Yn) for (X, Y), then 


E(S? |X, =x1,...,Xn = Xn) = 0°. 


PROOF | See Section 10.6 for the proof of this result. E 


Therefore, the standard error of bı is then given by 


1/2 
1 a x? f 
Si — a e ee > 
n Sia —x)y 


and the standard error of b2 is then given by 


i -1/2 
s (= (x; - ) i 
i=l 


Under further assumptions, these standard errors can be interpreted just as we inter- 
preted standard errors of estimates of the mean in the location and location-scale nor- 
mal models. 


EXAMPLE 10.3.4 (Example 10.3.3 continued) 
Using the data in Example 10.3.3 and the formulas of Theorem 10.3.1, we obtain bı = 
1.33, b2 = 2.06 as the least-squares estimates of the intercept and slope, respectively. 
So the least-squares line is given by 1.33 + 2.06x. Using (10.3.6), we obtain s? = 1.06 
as the estimate of o°. 

Using the formulas of Theorem 10.3.3, the standard error of bı is 0.3408, while the 
standard error of bz is 0.1023. 

The prediction of Y at X = 2.0 is given by 1.33 +2.06 (2) = 5.45. Using Corollary 
10.3.1, this estimate has standard error 0.341. This prediction is an interpolation. E 


The ANOVA Decomposition and the F-Statistic 


The following result gives a decomposition of the total sum of squares X`; (vi — zy. 


Chapter 10: Relationships Among Variables 545 


Lemma 10.3.1 If (x1, y1), -- -, Xn, Yn) are such that 5°"_, Qi — X)? £0, then 


yo — jy =b% dei =a) Oy eh ba) 
L=] l 


i=1 


PROOF | The proof of this result can be found in Section 10.6. E 


We refer to 
n 
b3 >) i — 8)? 
i=l 


as the regression sum of squares (RSS) and refer to 


> Qi = bi = baxi? 


i=l 


as the error sum of squares (ESS). 

If we think of the total sum of squares as measuring the total observed variation in 
the response values y;, then Lemma 10.3.1 provides a decomposition of this variation 
into the RSS, measuring changes in the response due to changes in the predictor, and 
the ESS, measuring changes in the response due to the contribution of random error. 

It is common to write this decomposition in an analysis of variance table (ANOVA). 


Df Sum of Squares Mean Square 
X 1 by) Be 


Eror |n-2 $i i-bi- bx) s 
Total |n-1 5i 0i- 


Here, Df stands for degrees of freedom (we will discuss how the Df entries are cal- 
culated in Section 10.3.4). The entries in the Mean Square column are calculated by 
dividing the corresponding sum of squares by the Df entry. 

To see the significance of the ANOVA table, note that, from Theorem 10.3.3, 


e(» > Ge -x7 


i=l 


n 
a ia aN = «) =0° + £3 >\@i -—¥)”, (10.3.7) 
i=l 


which is equal to ø? if and only if 8, = 0 (we are always assuming here that the x; 
vary). Given that the simple linear regression model is correct, we have that 6, = 0 if 
and only if there is no relationship between the response and the predictor. Therefore, 
b2 Si @ X)? is an unbiased estimator of c° if and only if B = 0. Because s? 
is always an unbiased estimate of ø? (Theorem 10.3.4), a sensible statistic to use in 
assessing Hy : B> = 0, is given by 


RSS. 3 i — 8)? 


546 Section 10.3: Quantitative Response and Predictors 


as this is the ratio of two unbiased estimators of c? when Hp is true. We then conclude 
that we have evidence against Hp when F is large, as (10.3.7) also shows that the 
numerator will tend to be larger than o? when Ho is false. We refer to (10.3.8) as the 
F-statistic. We will subsequently discuss the sampling distribution of F to see how to 
determine when the value F is so large as to be evidence against Hp. 


EXAMPLE 10.3.5 (Example 10.3.3 continued) 
Using the data of Example 10.3.3, we obtain 


n 
S 0- = 43701, 
i=l 
n 

bs > i - 3 = 428.55, 
i=l 


n 
S Oi — bi — bx)? = 437.01 — 428.55 = 8.46, 


i=l 


and so 3 5 
b5” ;—x ; 
= b5 Dia Gi -D = 428.55 = 404.29. 
s2 1.06 


Note that F is much bigger than 1, and this seems to indicate a linear effect due to X. E 


F 


The Coefficient of Determination and Correlation 


Lemma 10.3.1 implies that 
R2 = b X; (xi —x) 


>", 01 - 3)” 


satisfies 0 < R? < 1. Therefore, the closer R? is to 1, the more of the observed total 
variation in the response is accounted for by changes in the predictor. In fact, we 
interpret R?, called the coefficient of determination, as the proportion of the observed 
variation in the response explained by changes in the predictor via the simple linear 
regression. 

The coefficient of determination is an important descriptive statistic, for, even if 
we conclude that a relationship does exist, it can happen that most of the observed 
variation is due to error. If we want to use the model to predict further values of the 
response, then the coefficient of determination tells us whether we can expect highly 
accurate predictions or not. A value of R? near 1 means highly accurate predictions, 
whereas a value near 0 means that predictions will not be very accurate. 


EXAMPLE 10.3.6 (Example 10.3.3 continued) 

Using the data of Example 10.3.3, we obtain R? = 0.981. Therefore, 98.1% of the ob- 
served variation in Y can be explained by the changes in X through the linear relation. 
This indicates that we can expect fairly accurate predictions when using this model, at 
least when we are predicting within the range of the observed X values. E 


Chapter 10: Relationships Among Variables 547 


Recall that in Section 3.3, we defined the correlation coefficient between random 
variables X and Y to be 


Cov(X, Y) 
= Corr(X, Y) = — <, 
pxy = Com(x, Y) = ca say) 
In Corollary 3.6.1, we proved that —1 < pyy < 1 with pyy = +1 if and only if 


Y = a + cX for some constants a € R! and c > 0. So p yy can be taken as a measure 
of the extent to which a linear relationship exists between X and Y. 

If we do not know the joint distribution of (X, Y) , then we will have to estimate 
pxy. Based on the observations (x1, y1),..-, Œn, Yn), the natural estimate to use is 
the sample correlation coefficient 


Sxy 


Seyi 


Fxy 


where 
1 n 
Sxy = — > Gi —x)Qi -y) 
i=l 


is the sample covariance estimating Cov(X, Y), and sx, s, are the sample standard 
deviations for the X and Y variables, respectively. Then —1 < rxy <1 with ry = +1 
if and only if y; = a +cx; for some constants a € R! and c > 0, for every i (the proof 
is the same as in Corollary 3.6.1 using the joint distribution that puts probability mass 
1/n at each point (x;, yi) — see Problem 3.6.16). 

The following result shows that the coefficient of determination is the square of the 
correlation between the observed X and Y values. 


Theorem 10.3.5 If (x1, y1), .--, Œn, Yn) are such that X; Qi — xy £0, 


SL 6; — y)? £0, then R? = ae 


PROOF | We have 


2 ne-m- puna- _ po 


wS ia i Sy 


3: 


where we have used the formula for b2 given in Theorem 10.3.1. E 


Confidence Intervals and Testing Hypotheses 


We need to make some further assumptions in order to discuss the sampling distribu- 
tions of the various statistics that we have introduced. We have the following results. 


548 Section 10.3: Quantitative Response and Predictors 


Theorem 10.3.6 If Y, given X = x, is distributed N(B, + Bx, o°), and we ob- 
serve the independent values (x1, y1), ..., Œn, Yn) for (X, Y), then the conditional 
distributions of B1, B2, and SŽ, given X1 = x1,..., Xn = Xn, areas follows. 


(i) Bi ~ N(B,, 07 /n +.X7/ Xia — ¥)*)) 


(ii) By ~ N(B2, 07 / È; i — ¥)”) 
(iii) By + Box ~ N(B + Box, 0° /n + @ —x)*/ 1G; — x))) 
(iv) (n — 2) S? /o? ~ y? (n — 2) independent of (B1, B2) 


PROOF | The proof of this result can be found in Section 10.6. E 


Corollary 10.3.2 

(i) (Bi — By) /(SC/n +. ¥7/ Eia e — )?)'/”) ~ t(n - 2) 
(ii) (B2 — Br) (Sy  — 8)?)'/7/S ~ t(n = 2) 

(iii) 


Bı + Box — pı — Box $ 
S/n + @ = ¥))/ Ge = 7) 


t(n — 2) 


(iv) If F is defined as in (10.3.8), then Hp : 6 = O is true if and only if F ~ 
F(,n—2). 


PROOF | The proof of this result can be found in Section 10.6. E 
Using Corollary 10.3.2(i), we have that 


h 1/2 
by s(n +S (i -#) taty)/2 (n — 2) 
i=l 


is an exact y -confidence interval for 8,. Also, from Corollary 10.3.2(i1), 


r -1/2 
b2 s($e 7) ta+y)/2 (n — 2) 


i=l 


is an exact y -confidence interval for p3. 
From Corollary 10.3.2(iv), we can test Hp : 6, = 0 by computing the P-value 


250 Pee AZ 
p(r > b5 Prat x) ). (10.3.9) 


where F ~ F (1, —2), to see whether or not the observed value (10.3.8) is surprising. 
This is sometimes called the ANOVA test. Note that Corollary 10.3.2(i1) implies that 
we can also test Hp : 6, = 0 by computing the P-value 


n EEN 1/2 
a > zema) (10.3.10) 


S 


Chapter 10: Relationships Among Variables 549 
where T ~ t(n — 2). The proof of Corollary 10.3.2(iv) reveals that (10.3.9) and 
(10.3.10) are equal. 


EXAMPLE 10.3.7 (Example 10.3.3 continued) 
Using software or Table D.4, we obtain to.975 (8) = 2.306. Then, using the data of 
Example 10.3.3, we obtain a 0.95-confidence interval for 6, as 


i 1/2 
bits (u +2°/ > 6- ») tty)/2(n — 2) 
i=l 


= 1.33 + (0.3408) (2.306) = [0.544, 2.116] 


and a 0.95-confidence interval for f, as 


: -1/2 
b2 £s (È (xi — >) ta+y)/20 — 2) 
i=l 


= 2.06 + (0.1023) (2.306) = [1.824, 2.296]. 


The 0.95-confidence interval for £, does not include 0, so we have evidence against 
the null hypothesis Ho : 6, = 0 and conclude that there is evidence of a relationship 
between X and Y. This is confirmed by the F-test of this null hypothesis, as it gives the 
P-value P(F > 404.29) = 0.000 when F ~ F (1, 8). 


Analysis of Residuals 


In an application of the simple regression model, we must check to make sure that 
the assumptions make sense in light of the data we have collected. Model checking is 
based on the residuals y; — bı — bx; (after standardization), as discussed in Section 
9.1. Note that the ith residual is just the difference between the observed value y; at x; 
and the predicted value b; + b2x; at x;. 

From the proof of Theorem 10.3.4, we have the following result. 


Corollary 10.3.3 
(i) EW — By — Box; |X) ER Alp = Xn) =0 


ESNE 
0) Ve = Di = Ba a ee eee (1 Siha == 
i=] Mi 


This leads to the definition of the ith standardized residual as 


yi — bı — box; 


ee ee erer VRON A . (10.3. 1 1) 
s (1 SoS Gp=a) 7 ya (x; —*) ) 

Corollary 10.3.3 says that (10.3.11), with o replacing s, is a value from a distri- 
bution with conditional mean 0 and conditional variance 1. Furthermore, when the 
conditional distribution of the response given the predictors is normal, then the con- 
ditional distribution of this quantity is N(0, 1) (see Problem 10.3.21). These results 


550 Section 10.3: Quantitative Response and Predictors 


are approximately true for (10.3.11) for large n. Furthermore, it can be shown (see 
Problem 10.3.20) that 


Cov(¥; — By — Box;, Yj — By — Box; | X1 =x1,...,Xn = Xn) 


__ ofl, Gi -¥) (yj -%) 
ae ete Oa | 
n Jg- Y) 
Therefore, under the normality assumption, the residuals are approximately indepen- 
dent when n is large and 


Xj —X 


y Ži k — zy 


asn — oo. This will be the case whenever Var(X) is finite (see Challenge 10.3.27) 
or, in the design context, when the values of the predictor are chosen accordingly. So 
one approach to model checking here is to see whether the values given by (10.3.11) 
look at all like a sample from the N(0, 1) distribution. For this, we can use the plots 
discussed in Chapter 9. 


EXAMPLE 10.3.8 (Example 10.3.3 continued) 
Using the data of Example 10.3.3, we obtain the following standardized residuals. 


>0 


—0.49643 0.43212 —1.73371 1.00487 0.08358 


0.17348 0.75281 —0.28430 —1.43570 1.51027 


These are plotted against the predictor x in Figure 10.3.2. 


Standardized Residuals 
e 


Figure 10.3.2: Plot of the standardized residuals in Example 10.3.8. 


It is recommended that we plot the standardized residuals against the predictor, as 
this may reveal some underlying relationship that has not been captured by the model. 
This residual plot looks reasonable. In Figure 10.3.3, we have a normal probability plot 
of the standardized residuals. These points lie close to the line through the origin with 
slope equal to 1, so we conclude that we have no evidence against the model here. E 


Chapter 10: Relationships Among Variables 551 


Normal Score 
e 


Standardized Residual 


Figure 10.3.3: Normal probability plot of the standardized residuals in Example 10.3.8. 


What do we do if model checking leads to a failure of the model? As discussed 
in Chapter 9, perhaps the most common approach is to consider making various trans- 
formations of the data to see whether there is a simple modification of the model that 
will pass. We can make transformations, not only to the response variable Y, but to the 
predictor variable X as well. 


An Application of Simple Linear Regression Analysis 


The following data set is taken from Statistical Methods, 6th ed., by G. Snedecor and 
W. Cochran (Iowa State University Press, Ames, 1967) and gives the record speed Y 
in miles per hour at the Indianapolis Memorial Day car races in the years 1911-1941, 
excepting the years 1917—1918. We have coded the year X starting at 0 in 1911 and 
incrementing by 1 for each year. There are n = 29 data points (x;, y;). The goal of 
the analysis is to obtain the least-squares line and, if warranted, make inferences about 
the regression coefficients. We take the normal simple linear regression model as our 
statistical model. Note that this is an observational study. 


Year Speed | Year Speed | Year Speed 


0 
1 
2 
3 
4 
5 
8 
9 
0 
1 


1 
1 


Using Theorem 10.3.1, we obtain the least-squares line as y = 77.5681+1.27793x. 
This line, together with a scatter plot of the values (x;, yi), is plotted in Figure 10.3.4. 


552 Section 10.3: Quantitative Response and Predictors 


The fit looks quite good, but this is no guarantee of model correctness, and we must 
carry out some form of model checking. 

Figure 10.3.5 is a plot of the standardized residuals against the predictor. This plot 
looks reasonable, with no particularly unusual pattern apparent. Figure 10.3.6 is a nor- 
mal probability plot of the standardized residuals. The curvature in the center might 
give rise to some doubt about the normality assumption. We generated a few samples 
ofn = 29 from an N (0, 1) distribution, however, and looking at the normal probabil- 
ity plots (always recommended) reveals that this is not much cause for concern. Of 
course, we should also carry out model checking procedures based upon the standard- 
ized residuals and using P-values, but we do not pursue this topic further here. 


Regression Plot 
Speed = 77.5681 + 1.27793 Year 
S=2,99865 RSq=94.0% R-Sq(adj) = 93.8% 


120 — 


110 = 


100 — 


Speed 


90 + 


80 — 


70 = 


Year 


Figure 10.3.4: A scatter plot of the data together with a plot of the least-squares line. 


Residuals Versus Year 
(response is Speed) 


Standardized Residual 


Year 


Figure 10.3.5: A plot of the standardized residuals against the predictor. 


Chapter 10: Relationships Among Variables 553 


Normal Probability Plot of the Residuals 
(response is Speed) 


Normal Score 
Oo 
l 


T T T T T T 
-2 4 (o 1 2 3 


Standardized Residual 


Figure 10.3.6: A normal probability plot of the standardized residuals. 


Based on the results of our model checking, we decide to proceed to inferences 
about the regression coefficients. The estimates and their standard errors are given in 
the following table, where we have used the estimate of c? given by s? = (2.999), 
to compute the standard errors. We have also recorded the ¢-statistics appropriate for 
testing each of the hypotheses Ho : 8; = 0 and Ho : f = 0. 


Estimate Standard Error t-statistic 


By 77.568 1.118 69.39 
Bo 1.278 0.062 20.55 


From this, we see that the P-value for assessing Hp : 2 = 0 is given by 


P(|T| > 20.55) = 0.000, 


when T ~ t (27), and so we have strong evidence against Ho. It seems clear that there 
is a strong positive relationship between Y and X. Since the 0.975 point of the ¢(27) 
distribution equals 2.0518, a 0.95-confidence interval for 6 is given by 


1.278 + (0.062) 2.0518 = [1.1508, 1.4052]. 


The ANOVA decomposition is given in the following table. 


Df Sumof Squares Mean Square 


Regression 


Error 
Total 


Accordingly, we have that F = 3797.0/9.0 = 421.888 and, as F ~ F (1,27) when 
Ho : By = 0 is true, P(F > 421.888) = 0.000, which simply confirms (as it must) 
what we got from the preceding t-test. 

The coefficient of determination is given by R? = 3797.0/4039.8 = 0.94. There- 
fore, 94% of the observed variation in the response variable can be explained by the 


554 Section 10.3: Quantitative Response and Predictors 


changes in the predictor through the simple linear regression. The value of R? indicates 
that the fitted model will be an excellent predictor of future values, provided that the 
value of X that we want to predict at is in the range (or close to it) of the values of X 
used to fit the model. 


10.3.3 | Bayesian Simple Linear Model (Advanced) 


For the Bayesian formulation of the simple linear regression model with normal error, 
we need to add a prior distribution for the unknown parameters of the model, namely, 
B1, B2, and o?. There are many possible choices for this. A relevant prior is dependent 
on the application. 

To help simplify the calculations, we reparameterize the model as follows. Let 
a, = fı + Box and az = f2. It is then easy to show (see Problem 10.3.24) that 


> 0% — fy — Baxi) = > 0: — a1 — a2(x; = 
i=l i=l 
= $ (0: - 7) - (@1 — 7) - azi — x)’ 
i=l 
= 101-5 +a — 5° +03 Dw — 9? 
i=l i=] 


—2a2 So ~ ¥)(y; — 9). (10.3.12) 


i=l 


The likelihood function, using this reparameterization, then equals 


(r) eo(-z 2 (vi — a1 — a2 (xi — oy). 


From (10.3.12), and setting 


n 

=\2 

f = So — Xx) > 
i=l 
n 


EDE 


ny = S -50 -—), 


i=l 


we can write this as 


Chapter 10: Relationships Among Variables 555 


—n/2 Co Ca n 
= (220°) exp (i)e (-s5 (a = vy) 
2 
g 2 
x exp (= (2 a), 


where the last equality follows from azc? — 


2a2Cxy = c? (a2 — a) — ow with a = 
Caylee. 

This implies that, whenever the prior distribution on (a1, a2) is such that a, and 
az are independent given co”, then the posterior distributions of a, and a are also 
independent given o*. Note also that y and a are the least-squares estimates (as well 
as the MLE’s) of a; and a2, respectively (see Problem 10.3.24). 


Now suppose we take the prior to be 


2 2.2 
a,|a2,0° ~ N(uy, 110°), 
2 2:2 
a2|o° ~ N(m 750°), 
1/o? ~ Gamma(x,v). 


Note that aı and a2 are independent given o°. 
As it turns out, this prior is conjugate, so we can easily determine an exact form for 
the posterior distribution (see Problem 10.3.25). The joint posterior of (a1, a2, 1/07) 


is given by 


Re) | at 
1 B 

a |a2,07 ~ N (+3) (+4). (n+5) o? ; 
T ti 1 
A~” Ey" 

az |0? ~ N (2+3) (a+ 4).(¢+3) o? g 
T3 T3 T3 
1 


where 


556 Section 10.3: Quantitative Response and Predictors 


Of course, we must select the values of the hyperparameters 41, 71, 2,72, K, and v 
to fully specify the prior. 

Now observe that for a diffuse analysis, i.e., when we have little or no prior infor- 
mation about the parameters, we let t1 — 00, T2 oo, and v — 0, and the posterior 
converges to 


ai|a2,0° ~ N(j,o*/n), 
a2 |o? ~ N(a,o7/c?), 
1/o? ~ Gamma(« + 7/2, vxy) 


where vx, = (1 /2){c} — c2a’}. But this still leaves us with the necessity of choosing 
the hyperparameter x. We will see, however, that this choice has only a small effect on 
the analysis when n is not too small. 

We can easily work out the marginal posterior distribution of the a;. For example, 
in the diffuse case, the marginal posterior density of a2 is proportional to 


œ 7 1 \ 12 2 1 KE 
[ (a) sjea La) 
o 7 1 \e+(n/2)—(1/2) e 1 1 
=f (=) exp |- (w+ 2-0) Za} d (23). 


Making the change of variable 1/0” — w, where 


2 
Cy 1 
v=(»+$ @-a?) z 


in the preceding integral, shows that the marginal posterior density of a2 is proportional 


to 

c —(k+(2+1)/2) poo 

(1 + z, Me —a) *) ii wXt/2)—(1/2) exp {-—w} dw, 
0 


which is ee to 


2 —(2k+n+1)/2 
(: +— (a2 — a?) : 
2Vxy 


This establishes (see Problem 4.6.17) that the posterior distribution of a2 is specified 
by 
De PR HOR En), 


\/ 2Vxy/C2 


So a y -HPD (highest posterior density) interval for a2 is given by 


1 [2Vxy 
a ETIT “ae tonne +7). 


Chapter 10: Relationships Among Variables 557 


Note that these intervals will not change much as we change x, provided that n is not 
too small. 
We consider an application of a Bayesian analysis for such a model. 


EXAMPLE 10.3.9 Haavelmo’s Data on Income and Investment 

The data for this example were taken from An Introduction to Bayesian Inference in 
Econometrics, by A. Zellner (Wiley Classics, New York, 1996). The response variable 
Y is income in U.S. dollars per capita (deflated), and the predictor variable X is invest- 
ment in dollars per capita (deflated) for the United States for the years 1922-1941. The 
data are provided in the following table. 


Year Income Investment | Year Income Investment 


In Figure 10.3.7, we present a normal probability plot of the standardized residuals, 
obtained via a least-squares fit. In Figure 10.3.8, we present a plot of the standardized 
residuals against the predictor. Both plots indicate that the model assumptions are 
reasonable. 

Suppose now that we analyze these data using the limiting diffuse prior with x = 2. 
Here, we have that y = 483, c? = 64993, c2 = 5710.55, and cxy = 17408.3, so that 
a = 17408.3/5710.55 = 3.05 and vxy = (64993 — 17408.3) /2 = 23792.35. The 
posterior is then given by 


ailaz,o? ~ N(483,07/20), 
az|o? ~ N(3.05,07/5710.55), 
1/c? ~ Gamma(12, 23792.35). 


The primary interest here is in the investment multiplier a2. By the above results, a 
0.95-HPD interval for a2, using to.975 (24) = 2.0639, is given by 


1 2Vxy 
at a A +n- 1) 
x 
1 


2 . 23792.35 
= 3.05 + —— ,/ ———— t 24) = 3.05 + (0.589) 2.0639 
TA 3710.55 0.975 (24) ( ) 


= (1.834, 4.266). 


558 Section 10.3: Quantitative Response and Predictors 


Normal Probability Plot of the Residuals 


(response is Income) 


Normal Score 
Oo 
| 


T T 
-2 -1 0 1 


Standardized Residual 


Figure 10.3.7: Normal probability plot of the standardized residuals in Example 10.3.9. 


Standardized Residuals 


T T T T T T T T T T 
10 20 30 40 50 60 70 80 90 100 
Investment 


Figure 10.3.8: Plot of the standardized residuals against the predictor in Example 10.3.9. 


10.3.4 | The Multiple Linear Regression Model (Advanced) 


We now consider the situation in which we have a quantitative response Y and quanti- 
tative predictors X1, ..., Xp. For the regression model, we assume that the conditional 
distributions of Y, given the predictors, have constant shape and that they change, as the 
predictors change, at most through the conditional mean F (Y |X, = x1, ..., Xk = xk). 
For the linear regression model, we assume that this conditional mean is of the form 


EY |X, =x1,..., Xk = Xk) = Byx1 +--+ + Pex. (10.3.13) 


This is linear in the unknown £; € R! fori =1,...,k. 


Chapter 10: Relationships Among Variables 559 


We will develop only the broad outline of the analysis of the multiple linear regres- 
sion model here. All results will be stated without proofs provided. The proofs can be 
found in more advanced texts. It is important to note, however, that all of these results 
are just analogs of the results we developed by elementary methods in Section 10.3.2, 
for the simple linear regression model. 


Matrix Formulation of the Least-Squares Problem 


For the analysis of the multiple linear regression model, we need some matrix concepts. 
We will briefly discuss some of these here, but also see Appendix A.4. 

Let A e R”*” denote a rectangular array of numbers with m rows and n columns, 
and let a;; denote the entry in the ith row and jth column (referred to as the (i, j)-th 
entry of A). For example, 


SFI, 10 00 2x3 
4=(35 0.2 ga JER 


denotes a 2 x 3 matrix and, for example, a22 = 0.2. 

We can add two matrices of the same dimensions m and n by simply adding their 
elements componentwise. So if A,B e R”*" and C = A+B, then cij = aij + 
bij. Furthermore, we can multiply a matrix by a real number c by simply multiplying 
every entry in the matrix by c. So if A e R”*", then B = cA e R”™*" and bij = 
caij. We will sometimes write a matrix A € R”%*” in terms of its columns as A = 
(a, ... an )so that here a; e R”. Finally, if A e R”*” and b e R”, then we 
define the product of A times b as Ab = biai +--+ + bnan € R”. 

Suppose now that Y e R” and that E (Y) is constrained to lie in a set of the form 


S = {ivi +- + Brok: Bi e R',i =1,..., 4}, 


where v1, ..., vg are fixed vectors in R”. A set such as S is called a linear subspace of 
R”. When {v1, ..., vg} has the linear independence property, namely, 


Bivi ++ pror =0 


if and only if 6; =--- = p = 0, then we say that S has dimension k and {v1,..., vx} 
is a basis for S. 
If we set 
01] 012 Vik 
V21 022 +++ UI k 
V=(1---o)=] n. e R"™", 
Onl Un2 °°: Unk 


then we can write 


Bio + 2012 +-+ + Brie 


P1021 + 2022 +--+ + By vr 
EY) = pivi +--+ + kok = . 


Bont + B20n2 + +++ + Byvnk 


560 Section 10.3: Quantitative Response and Predictors 


for some unknown point 8 = (81, Bo, ..., B). When we observe y € R”, then the 
least-squares estimate of E (Y) is obtained by finding the value of 2 that minimizes 


n 


> (vi — Byvi — Bavi2 — +++ — Bxvik) - 


It can be proved that a unique minimizing value for B € R* exists whenever 
{v1,..., vk} is a basis. The minimizing value of # will be denoted by b and is called 
the least-squares estimate of p. The point bivi +- - -+ bkok = Vb is the least-squares 
estimate of E (Y) and is sometimes called the vector of fitted values. The point y — Vb 
is called the vector of residuals. 

We now consider how to calculate b. For this, we need to understand what it means 
to multiply the matrix 4 € R”™** on the right by the matrix B € R‘*”. The matrix 
product AB is defined to be the m x n matrix whose (i, /)-th entry is given by 


k 
S aiby. 
f=) 


Notice that the array A must have the same number of columns as the number of rows 
of B for this product to be defined. The transpose of a matrix A € R”*% is defined to 
be 

ail >> ami 
A= X 3 E R", 

dik +++ Amk 


namely, the ith column of A becomes the ith row of A’. For a matrix A € R***, the 
matrix inverse of A is defined to be the matrix AT! such that 


AAT! = ATA =], 


where J € R¥** has 1’s along its diagonal and 0’s everywhere else; it is called the k x k 
identity matrix. It is not always the case that A € R*** has an inverse, but when it does 
it can be shown that the inverse is unique. Note that there are many mathematical and 
statistical software packages that include the facility for computing matrix products, 
transposes, and inverses. 

We have the following fundamental result. 


Theorem 10.3.7 If E(Y) € S = {ivi +--+ hyve: Bi E R! i =1,...,k} 
and the columns of V = (vı ---vg) have the linear independence property, then 
(V’ vy)! exists, the least-squares estimate of £ is unique, and it is given by 


by 
1 


=(V'V) Vy. (10.3.14) 


bk 


Chapter 10: Relationships Among Variables 561 


Least-Squares Estimates, Predictions, and Standard Errors 
For the linear regression model (10.3.13), we have that (writing X;; for the jth value 
of X;) 
Yı Bixi tee + Bex Ik 
E : Xij = Xij for all i, j = : 
Yn ByXnt +-+- + BexXnk 
= Bo t+---+ prok = VB, 


where $ = (81, -.-, B,)/ and 


Xi e Mk 
V=( vi 02... 0p J= ; ; eRe 
Xni t Xnk 
We will assume, hereafter, that the columns v1, ..., vog of V have the linear indepen- 


dence property. Then (replacing expectation by conditional expectation) it is immediate 
that the least-squares estimate of £ is given by (10.3.14). 

As with the simple linear regression model, we have a number of results concerning 
the least-squares estimates. We state these here without proof. 


Theorem 10.3.8 Ifthe (x;1, ..., xj, y;) are independent observations fori = 1,..., 
n, and the linear regression model applies, then 


E(B; | Xi; = Xij for all i, j) = Êi 


So Theorem 10.3.8 states that the least-squares estimates are unbiased estimates of the 
linear regression coefficients. 

If we want to assess the accuracy of these estimates, then we need to be able to 
compute their standard errors. 


Theorem 10.3.9 Ifthe (xj1, ..., xix, yi) are independent observations fori = 1,..., 
n, from the linear regression model, and if Var(Y |X, = x1,...,X% = xk) = o? 
for every x1, ..., Xg, then 


Cov(B;, Bj | Xi; = xij for all i, j) = 07c;;, (10.3.15) 


where cj; is the (i, j)-th entry in the matrix (V’/V)!. 


We have the following result concerning the estimation of the mean 
EY |X = X15 500546) SNE = pixi +- + BpXE 


by the estimate bjx; +--+ + bx x. 


562 Section 10.3: Quantitative Response and Predictors 


Corollary 10.3.4 


Var( Bx, +- -< + Bkxk | Xij = xi; for all i, 7) 


k 
toy Ade esa cre UAW ras (10.3.16) 
=i 


i<j 


WANES 3 (Clg oa on Es 


We also use bix +--+ + bkxk = b’x as a prediction of a new response value when 
XxX] = Xi a XE = Xk. 

We see, from Theorem 10.3.9 and Corollary 10.3.4, that we need an estimate of o? 
to compute standard errors. The estimate is given by 


1 1 
52 = T ; Oi — bixil ee byxix)? — c= pa Xby' (y — Xb), (10.3.17) 
i=l 


and we have the following result. 


Theorem 10.3.10 If the (x;1,...,Xik, yi) are independent observations for i = 
1, ..., n, from the linear regression model, and if Var(Y | X1 = x1, .. ., Xk = xk) = 
o?, then 


E(S* | Xi; = xij for all i, j) = 0°. 


Combining (10.3.15) and (10.3.17), we deduce that the standard error of b; is s /C;;. 
Combining (10.3.16) and (10.3.17), we deduce that the standard error of bixı +---+ 
bx, is 


a 1/2 
(Sa +25 ney) =s x (VV x). 
i=l 


i<j 


The ANOVA Decomposition and F-Statistics 


When one of the predictors X1,..., Xk is constant, then we say that the model has an 
intercept term. By convention, we will always take this to be the first predictor. So 
when we want the model to have an intercept term, we take X; = 1 and f is the 
intercept, e.g., the simple linear regression model. Note that it is common to denote the 
intercept term by fp so that Xo = 1 and X1, . . . , Xg denote the predictors that actually 
change. We will also adopt this convention when it seems appropriate. 

Basically, inclusion of an intercept term is very common, as this says that, when 
the predictors that actually change have no relationship with the response Y, then the 
intercept is the unknown mean of the response. When we do not include an intercept, 
then this says we know that the mean response is 0 when there is no relationship be- 
tween Y and the nonconstant predictors. Unless there is substantive, application-based 
evidence to support this, we will generally not want to make this assumption. 

Denoting the intercept term by £4, so that X1 = 1, we have the following ANOVA 
decomposition for this model that shows how to isolate the observed variation in Y that 
can be explained by changes in the nonconstant predictors. 


Chapter 10: Relationships Among Variables 563 


Lemma 10.3.2 If, fori = 1,...,n, the values (x;1,..., x;x, Yi) are such that the 
matrix V has linearly independent columns, with vı equal to a column of ones, then 
by = y — boxX2 — - - - — bkXķk and 


So- = Lior — 2) +--+ + bie — E) 


EO CS = ba = ee ae 


n 


i= 


We call 
n 


RSS (X2, ..., Xk) = `X (by (x12 — ¥2) +- + br ik — XY 


i=l 
the regression sum of squares and 
n 
ESS = 5 (yi — bixi — ++ bexi) 
i=l 


the error sum of squares. This leads to the following ANOVA table. 


Df Sum of Squares Mean Square 
l k-11 RSS(X,...,.X_) RSS(,...,Xn/&-D 
2 


n—k ES sS 
n-1 jai- 


When there is an intercept term, the null hypothesis of no relationship between the 
response and the predictors is equivalent to Ho : B2 = --- = fp = 0. As with the 
simple linear regression model, the mean square for regression can be shown to be an 
unbiased estimator of ø? if and only if the null hypothesis is true. Therefore, a sensible 
statistic to use for assessing the null hypothesis is the F-statistic 


_ RSS(%,..., Xx)/ k — 1) 


F 5 ; 


S 


with large values being evidence against the null. 
Often, we want to assess the null hypothesis Ho : iyı = = = k = 0 or, 
equivalently, the hypothesis that the model is given by 


EV |X, =x1,..., Xk = xk) = bix +- + pix, 


where / < k. This hypothesis says that the last k — l predictors X741,..., Xk, have no 
relationship with the response. 

If we denote the least-squares estimates of £4, . . . , £1, obtained by fitting the smaller 
model, by bj, ..., 57, then we have the following result. 


564 Section 10.3: Quantitative Response and Predictors 


Lemma 10.3.3 If the (;1,..., xix, yi) fori = 1,...,n are values for which the 
matrix V has linearly independent columns, with vı equal to a column of ones, then 


n 


RSS(X2,...,Xe) = Di O2@i2 — 52) +--+ + bee — HP? 


=l 
S O3 Giz — ¥2) ++ + OF Eu — EY 
pl 


RSS, ..., X1). (10.3.18) 


On the right of the inequality in (10.3.18), we have the regression sum of squares 
obtained by fitting the model based on the first / predictors. Therefore, we can interpret 
the difference of the left and right sides of (10.3.18), namely, 


RSS(X741,..., Xk | X2,..., XD = RSS(X2, ..., Xk) — RSS(X),..., X1) 


as the contribution of the predictors X741,..., Xx to the regression sum of squares 
when the predictors X),..., Xz are in the model. We get the following ANOVA ta- 
ble (actually only the first three columns of the ANOVA table) corresponding to this 
decomposition of the total sum of squares. 


Di Sum oF Squares 


RSS, ..., X) 
RSS(Xi41,..., Xk | X2,..., X1) 


ESS 
Die Oi — 9 
It can be shown that the null hypothesis Ho : 41 = ++- = By, = 0 holds if and 
only if 


RSS(Xj41,..., Xe | X2,.-.,XD)/& — D) 


is an unbiased estimator of o*. Therefore, a sensible statistic to use for assessing this 
null hypothesis is the F’-statistic 


_ RSS(X141,..., Xe | X2, ..., XD/ k -D 


F 
s2 


with large values being evidence against the null. 


The Coefficient of Determination 
The coefficient of determination for this model is given by 
AE RSS(X2,..., Xy) 
Èa y < 
which, by Lemma 10.3.2, is always between 0 and 1. The value of R? gives the propor- 


tion of the observed variation in Y that is explained by the inclusion of the nonconstant 
predictors in the model. 


Chapter 10: Relationships Among Variables 565 


It can be shown that R? is the square of the multiple correlation coefficient between 
Y and X1, ..., Xx. However, we do not discuss the multiple correlation coefficient in 
this text. 


Confidence Intervals and Testing Hypotheses 


For inference, we have the following result. 


Theorem 10.3.11 Ifthe conditional distribution of Y given (X1,..., Xx) = (1,..., 
xk) is N(Byxi+--- +h xK, a”) and if we observe the independent values (xj1,..., 
Xik, Yi) fori = 1,...,n, then the conditional distributions of the B; and SŽ, given 
Xij = xij for alli, j, are as follows. 


(i) Bi ~ N(B;, 07 cii) 
Gi) Byxy +--+ + Byx; is distributed 


k 
v (hs ees Bix, o? (Zra JF 25e )) 
i=l i<j 


(iii) (n — k) S? /o? ~ y? (n — k) independent of (B1, ... , By) 


Corollary 10.3.5 
i 1/2 
(i) (Bi =f) /se, to t(n —k) 
(i) 
Bix) +- -- + Bkxk — Byx1 N = sg: 
E =e 
S (Zi X; Cii + 2 ie xixjcy) 


Gii) Ho : 64; =+- = Jp = 0 is true if and only if 


~t(n —k) 


_ (RSS(X2, ..., Xk) — RSS(X, -.., XD) /(k - D 
ESS 


F ~ F(k-l,n—k) 


Analysis of Residuals 


In an application of the multiple regression model, we must check to make sure that 
the assumptions make sense. Model checking is based on the residuals y; — bıxi1 — 
--- — bkxik (after standardization), just as discussed in Section 9.1. Note that the ith 
residual is simply the difference between the observed value y; at (vi1,..., xix) and 
the predicted value bix;1 +--+ + bkXik at (Xil, ..., Xik)- 

We also have the following result (this can be proved as a Corollary of Theorem 
10.3.10). 


566 Section 10.3: Quantitative Response and Predictors 


Corollary 10.3.6 
(i) E (Y; — Bixi —-+- — Byxjx |V) =0 


(ii) Cov(¥; — Bixiy — +++ — Bkxik, Yj — Bixji —++- — Bex jk, | V) = 07di;, where 
di; is the (i, 7)-th entry of the matrix J — VVV V". 


Therefore, the standardized residuals are given by 


Yj —bıxji — +++ — bkX jk 
sd” 


ii 


(10.3.19) 


When s is replaced by ø in (10.3.19), Corollary 10.3.6 implies that this quantity has 
conditional mean 0 and conditional variance 1. Furthermore, when the conditional 
distribution of the response given the predictors is normal, then it can be shown that 
the conditional distribution of this quantity is N (0, 1). These results are also approxi- 
mately true for (10.3.19) for large n. Furthermore, it can be shown that the covariances 
between the standardized residuals go to 0 as n —> oo, under certain reasonable con- 
ditions on distribution of the predictor variables. So one approach to model checking 
here is to see whether the values given by (10.3.19) look at all like a sample from the 
N (0, 1) distribution. 

What do we do if model checking leads to a failure of the model? As in Chapter 9, 
we can consider making various transformations of the data to see if there is a simple 
modification of the model that will pass. We can make transformations not only to the 
response variable Y, but to the predictor variables X1, ..., Xx as well. 


An Application of Multiple Linear Regression Analysis 


The computations needed to implement a multiple linear regression analysis cannot 
be carried out by hand. These are much too time-consuming and error-prone. It is 
therefore important that a statistician have a computer with suitable software available 
when doing a multiple linear regression analysis. 

The data in Table 10.1 are taken from Statistical Theory and Methodology in Sci- 
ence and Engineering, 2nd ed., by K. A. Brownlee (John Wiley & Sons, New York, 
1965). The response variable Y is stack loss (Loss), which represents 10 times the per- 
centage of ammonia lost as unabsorbed nitric oxide. The predictor variables are X1 = 
air flow (Air), X2 = temperature of inlet water (Temp), and X3 = the concentration of 
nitric acid (Acid). Also recorded is the day (Day) on which the observation was taken. 

We consider the model Y | x1, x2, x3 ~ N (Bo +81x1 +82x2 +83x3, 07). Note that 
we have included an intercept term. Figure 10.3.9 is a normal probability plot of the 
standardized residuals. This looks reasonable, except for one residual, —2.63822, that 
diverges quite distinctively from the rest of the values, which lie close to the 45-degree 
line. Printing out the standardized residuals shows that this residual is associated with 
the observation on the twenty-first day. Possibly there was something unique about this 
day’s operations, and so it is reasonable to discard this data value and refit the model. 
Figure 10.3.10 is a normal probability plot obtained by fitting the model to the first 
20 observations. This looks somewhat better, but still we might be concerned about at 
least one of the residuals that deviates substantially from the 45-degree line. 


Chapter 10: Relationships Among Variables 567 


Day Air Temp Acid Loss | Day Air Temp Acid Loss 


me 


2 
3 
4 
5 
6 
7 
8 
9 
0 
1 


eS jani 


Table 10.1: Data for Application of Multiple Linear Regression Analysis 


Normal Probability Plot of the Residuals 


(response is Loss) 


Normal Score 
o 
i 


T T T T T T 
-3 -2 -1 0 1 2 


Standardized Residual 


Figure 10.3.9: Normal probability plot of the standardized residuals based on all the data. 


Normal Probability Plot of the Residuals 


(response is Loss) 


Normal Score 
o 
ll 


T T T T T 
-1 0 1 2 3 


Standardized Residual 


Figure 10.3.10: Normal probability plot of the standardized residuals based on the first 20 data 
values. 


568 Section 10.3: Quantitative Response and Predictors 


Following the analysis of these data in Fitting Equations to Data, by C. Daniel and 
F. S. Wood (Wiley-Interscience, New York, 1971), we consider instead the model 


In Y |x1,x2,x3 ~ N(Bo + Bixi + b22 + B3x3, 0°), (10.3.20) 


i.e., we transform the response variable by taking its logarithm and use all of the data. 
Often, when models do not fit, simple transformations like this can lead to major im- 
provements. In this case, we see a much improved normal probability plot, as provided 
in Figure 10.3.11. 


Normal Probability Plot of the Residuals 


(response is Loss) 


Normal Score 
o 
ll 


T T T T T 
-2 -1 0 1 2 


Standardized Residual 


Figure 10.3.11: Normal probability plot of the standardized residuals for all the data using In Y 
as the response. 


We also looked at plots of the standardized residuals against the various predic- 
tors, and these looked reasonable. Figure 10.3.12 is a plot of the standardized residuals 
against the values of Air. 


Residuals Versus Air 


(response is Loss) 


Standardized Residual 


Figure 10.3.12: A plot of the standardized residuals for all the data, using In Y as the response, 
against the values of the predictor Air. 


Chapter 10: Relationships Among Variables 569 


Now that we have accepted the model (10.3.20), we can proceed to inferences about 
the unknowns of the model. The least-squares estimates of the 6; ,their standard errors 
(Se), the corresponding f-statistics for testing the £; = 0, and the P-values for this are 
given in the following table. 


Estimate Se t-statistic P-value 
—0.948700 0.647700 —1.46 0.161 
0.034565 0.007343 4.71 0.000 


0.063460 0.020040 3.17 0.006 
0.002864 0.008510 0.34 0.742 


The estimate of a? is given by s? = 0.0312. 

To test the null hypothesis that there is no relationship between the response and 
the predictors, or that, equivalently, Hp : fı = B. = p3 = 0, we have the following 
ANOVA table. 


Df Sumof Squares Mean Square 


4.9515 1.6505 


0.5302 0.0312 
5.4817 


The value of the F-statistic is given by 1.6505/0.0312 = 52.900, and when F ~ 
F(3,17), we have that P (F > 52.900) = 0.000. So there is substantial evidence 
against the null hypothesis. To see how well the model explains the variation in the 
response, we computed the value of R? = 86.9%. Therefore, approximately 87% of 
the observed variation in Y can be explained by changes in the predictors in the model. 

While we have concluded that a relationship exists between the response and the 
predictors, it may be that some of the predictors have no relationship with the response. 
For example, the table of ¢-statistics above would seem to indicate that perhaps X3 
(acid) is not affecting Y. We can assess this via the following ANOVA table, obtained 
by fitting the model In Y | x1, x2, x3 ~ N (Bo + Byx1 + Box2, 0°). 


Df Sum of Squares Mean Square 


Note that RSS(X3 | X1, X2) = 4.9515 — 4.9480 = 0.0035. The value of the F’-statistic 
for testing Hp : £3 = 0 is 0.0035/0.0312 = 0.112, and when F ~ F (1,17), we 
have that P(F > 0.112) = 0.742. So we have no evidence against the null hypothesis 
and can drop X3 from the model. Actually, this is the same P-value as obtained via the 
t-test of this null hypothesis, as, in general, the t-test that a single regression coefficient 
is 0 is equivalent to the F-test. Similar tests of the need to include X; and X2 do not 
lead us to drop these variables from the model. 

So based on the above results, we decide to drop X3 from the model and use the 
equation 


EY |X, = x1, X2 = x2) = —0.7522 + 0.035402.X, + 0.06346.X (10.3.21) 


570 Section 10.3: Quantitative Response and Predictors 


to describe the relationship between Y and the predictors. Note that the least-squares 
estimates of fo, f1, and £, in (10.3.21) are obtained by refitting the model without 
X3. 


Summary of Section 10.3 


e In this section, we examined the situation in which the response variable and the 
predictor variables are quantitative. 


In this situation, the linear regression model provides a possible description of 
the form of any relationship that may exist between the response and the predic- 
tors. 


Least squares is a standard method for fitting linear regression models to data. 


The ANOVA is a decomposition of the total variation observed in the response 
variable into a part attributable to changes in the predictor variables and a part 
attributable to random error. 


e If we assume a normal linear regression model, then we have inference methods 
available such as confidence intervals and tests of significance. In particular, we 
have available the F-test to assess whether or not a relationship exists between 
the response and the predictors. 


e A normal linear regression model is checked by examining the standardized 
residuals. 


EXERCISES 


10.3.1 Suppose that (x1, ...,Xn) is a sample from a Bernoulli (0) distribution, where 
6 e€ [0, 1] is unknown. What is the least-squares estimate of the mean of this distribu- 
tion? 

10.3.2 Suppose that (x1, ...,X,) is a sample from the Uniform[0, 6], where 0 > 0 is 
unknown. What is the least-squares estimate of the mean of this distribution? 

10.3.3 Suppose that (x1, ... , Xn) is a sample from the Exponential (0), where 0 > 0 is 
unknown. What is the least-squares estimate of the mean of this distribution? 

10.3.4 Consider the n = 11 data values in the following table. 


Observation X Observation X Y 


Suppose we consider the simple normal linear regression to describe the relationship 
between the response Y and the predictor X. 


(a) Plot the data in a scatter plot. 


Chapter 10: Relationships Among Variables 571 


(b) Calculate the least-squares line and plot this on the scatter plot in part (a). 

(c) Plot the standardized residuals against X. 

(d) Produce a normal probability plot of the standardized residuals. 

(e) What are your conclusions based on the plots produced in parts (c) and (d)? 

(f) If appropriate, calculate 0.95-confidence intervals for the intercept and slope. 

(g) Construct the ANOVA table to test whether or not there is a relationship between 
the response and the predictors. What is your conclusion? 

(h) If the model is correct, what proportion of the observed variation in the response is 
explained by changes in the predictor? 

(i) Predict a future Y at X = 0.0. Is this prediction an extrapolation or an interpolation? 
Determine the standard error of this prediction. 

(j) Predict a future Y at X = 6.0. Is this prediction an extrapolation or an interpolation? 
Determine the standard error of this prediction. 

(k) Predict a future Y at X = 20.0. Is this prediction an extrapolation or an interpola- 
tion? Determine the standard error of this prediction. Compare this with the standard 
errors obtained in parts (i) and (j) and explain the differences. 

10.3.5 Consider the n = 11 data values in the following table. 


Observation X Y Observation X 


Suppose we consider the simple normal linear regression to describe the relationship 
between the response Y and the predictor X. 


(a) Plot the data in a scatter plot. 

(b) Calculate the least-squares line and plot this on the scatter plot in part (a). 

(c) Plot the standardized residuals against X. 

(d) Produce a normal probability plot of the standardized residuals. 

(e) What are your conclusions based on the plots produced in parts (c) and (d)? 

(£) If appropriate, calculate 0.95-confidence intervals for the intercept and slope. 

(g) Do the results of your analysis allow you to conclude that there is a relationship 
between Y and X? Explain why or why not. 

(h) If the model is correct, what proportion of the observed variation in the response is 
explained by changes in the predictor? 

10.3.6 Suppose the following data record the densities of an organism in a containment 
vessel for 10 days. Suppose we consider the simple normal linear regression to describe 
the relationship between the response Y (density) and the predictor X (day). 


572 Section 10.3: Quantitative Response and Predictors 


Day Number/Liter | Day Number/Liter 


(a) Plot the data in a scatter plot. 

(b) Calculate the least-squares line and plot this on the scatter plot in part (a). 

(c) Plot the standardized residuals against X. 

(d) Produce a normal probability plot of the standardized residuals. 

(e) What are your conclusions based on the plots produced in parts (c) and (d)? 

(f) Can you think of a transformation of the response that might address any problems 
found? If so, repeat parts (a) through (e) after performing this transformation. (Hint: 
The scatter plot looks like exponential growth. What transformation is the inverse of 
exponentiation?) 

(g) Calculate 0.95-confidence intervals for the appropriate intercept and slope. 

(h) Construct the appropriate ANOVA table to test whether or not there is a relationship 
between the response and the predictors. What is your conclusion? 

(i) Do the results of your analysis allow you to conclude that there is a relationship 
between Y and X? Explain why or why not. 

(j) Compute the proportion of variation explained by the predictor for the two models 
you have considered. Compare the results. 

(k) Predict a future Y at X = 12. Is this prediction an extrapolation or an interpolation? 


10.3.7 A student takes weekly quizzes in a course and receives the following grades 


Week Grade | Week Grade 


over 12 weeks. 


(a) Plot the data in a scatter plot with X = week and Y = grade. 

(b) Calculate the least-squares line and plot this on the scatter plot in part (a). 
(c) Plot the standardized residuals against X. 

(d) What are your conclusions based on the plot produced in (c)? 

(e) Calculate 0.95-confidence intervals for the intercept and slope. 


(f) Construct the ANOVA table to test whether or not there is a relationship between 
the response and the predictors. What is your conclusion? 


(g) What proportion of the observed variation in the response is explained by changes 
in the predictor? 


Chapter 10: Relationships Among Variables 573 


10.3.8 Suppose that Y = E(Y | X) + Z, where X, Y and Z are random variables. 

(a) Show that E (Z | X) = 0. 

(b) Show that Cov(E(Y | X), Z) = 0. (Hint: Write Z = Y — E(Y | X) and use Theo- 
rems 3.5.2 and 3.5.4.) 


(c) Suppose that Z is independent of X. Show that this implies that the conditional 
distribution of Y given X depends on X only through its conditional mean. (Hint: 
Evaluate the conditional distribution function of Y given X = x.) 

10.3.9 Suppose that X and Y are random variables such that a regression model de- 
scribes the relationship between Y and X. If E(Y | X) = exp{f, + 22X}, then discuss 
whether or not this is a simple linear regression model (perhaps involving a predictor 
other than X). 

10.3.10 Suppose that X and Y are random variables and Corr(X, Y) = 1. Does a 
simple linear regression model hold to describe the relationship between Y and X? If 
so, what is it? 

10.3.11 Suppose that X and Y are random variables such that a regression model de- 
scribes the relationship between Y and X. If E(Y|X) = £; + 82X?, then discuss 
whether or not this is a simple linear regression model (perhaps involving a predictor 
other than X). 

10.3.12 Suppose that X ~ N(2, 3) independently of Z ~ N(0, 1) and Y = X + Z. 
Does this structure imply that the relationship between Y and X can be summarized by 
a simple linear regression model? If so, what are £4, £2, and o°? 

10.3.13 Suppose that a simple linear model is fit to data. An analysis of the residuals 
indicates that there is no reason to doubt that the model is correct; the ANOVA test 
indicates that there is substantial evidence against the null hypothesis of no relationship 
between the response and predictor. The value of R? is found to be 0.05. What is the 
interpretation of this number and what are the practical consequences? 


COMPUTER EXERCISES 


10.3.14 Suppose we consider the simple normal linear regression to describe the re- 
lationship between the response Y (income) and the predictor X (investment) for the 
data in Example 10.3.9. 


(a) Plot the data in a scatter plot. 

(b) Calculate the least-squares line and plot this on the scatter plot in part (a). 

(c) Plot the standardized residuals against X. 

(d) Produce a normal probability plot of the standardized residuals. 

(e) What are your conclusions based on the plots produced in parts (c) and (d)? 

(£) If appropriate, calculate 0.95-confidence intervals for the intercept and slope. 

(g) Do the results of your analysis allow you to conclude that there is a relationship 
between Y and X? Explain why or why not. 

(h) If the model is correct, what proportion of the observed variation in the response is 
explained by changes in the predictor? 


574 Section 10.3: Quantitative Response and Predictors 


10.3.15 The following data are measurements of tensile strength (100 Ib/in”) and hard- 
ness (Rockwell E) on 20 pieces of die-cast aluminum. 


Sample Strength Hardness | Sample Strength Hardness 


— 


OANANIDNAHBWN 


fank 
ham] 


Suppose we consider the simple normal linear regression to describe the relationship 
between the response Y (strength) and the predictor X (hardness). 

(a) Plot the data in a scatter plot. 

(b) Calculate the least-squares line and plot this on the scatter plot in part (a). 

(c) Plot the standardized residuals against X. 

(d) Produce a normal probability plot of the standardized residuals. 

(e) What are your conclusions based on the plots produced in parts (c) and (d)? 

(f) If appropriate, calculate 0.95-confidence intervals for the intercept and slope. 

(g) Do the results of your analysis allow you to conclude that there is a relationship 
between Y and X? Explain why or why not. 

(h) If the model is correct, what proportion of the observed variation in the response is 
explained by changes in the predictor? 

10.3.16 Tests were carried out to determine the effect of gas inlet temperature (degrees 
Fahrenheit) and rotor speed (rpm) on the tar content (grains/cu ft) of a gas stream, 
producing the following data. 


Tar Speed Temperature 


= 


2 
3 
4 
5 
6 
7 
8 
9 


= 
© 


Suppose we consider the normal linear regression model 


Y|W =w, X =x ~ N($; + Bow + p3x, 0°) 


Chapter 10: Relationships Among Variables 575 


to describe the relationship between Y (tar content) and the predictors W (rotor speed) 
and X (temperature). 


(a) Plot the response in scatter plots against each predictor. 

(b) Calculate the least-squares equation. 

(c) Plot the standardized residuals against W and X. 

(d) Produce a normal probability plot of the standardized residuals. 

(e) What are your conclusions based on the plots produced in parts (c) and (d)? 

(f) If appropriate, calculate 0.95-confidence intervals for the regression coefficients. 


(g) Construct the ANOVA table to test whether or not there is a relationship between 
the response and the predictors. What is your conclusion? 


(h) If the model is correct, what proportion of the observed variation in the response is 
explained by changes in the predictors? 


(i) Inan ANOVA table, assess the null hypothesis that there is no effect due to W, given 
that X is in the model. 


(j) Estimate the mean of Y when W = 2750 and X = 50.0. If we consider this value 
as a prediction of a future Y at these settings, is this an extrapolation or interpolation? 


10.3.17 Suppose we consider the normal linear regression model 
Y| X =x ~ NB, + Box + B3x?, 0°) 


for the data of Exercise 10.3.5. 

(a) Plot the response Y in a scatter plot against X. 

(b) Calculate the least-squares equation. 

(c) Plot the standardized residuals against X. 

(d) Produce a normal probability plot of the standardized residuals. 

(e) What are your conclusions based on the plots produced in parts (c) and (d)? 

(f) If appropriate, calculate 0.95-confidence intervals for the regression coefficients. 


(g) Construct the ANOVA table to test whether or not there is a relationship between 
the response and the predictor. What is your conclusion? 


(h) If the model is correct, what proportion of the observed variation in the response is 
explained by changes in the predictors? 


(i) In an ANOVA table, assess the null hypothesis that there is no effect due to X°, 
given that X is in the model. 


(j) Compare the predictions of Y at X = 6 using the simple linear regression model 
and using the linear model with a linear and quadratic term. 


PROBLEMS 


10.3.18 Suppose that (x1, ... , Xn) is a sample from the mixture distribution 
0.5Uniform[0, 1] + 0.5Uniform[2, 0], 


where 0 > 2 is unknown. What is the least-squares estimate of the mean of this 
distribution? 


576 Section 10.3: Quantitative Response and Predictors 


10.3.19 Consider the simple linear regression model and suppose that for the data col- 
lected, we have $`; (xj — X)? = 0. Explain how, and for which value of x, you 
would estimate E(Y | X = x). 


10.3.20 For the simple linear regression model, under the assumptions of Theorem 
10.3.3, establish that 


Cov(Y; — Bı — Box;, Yj — By — Box; | X1 =x1,...,Xn =Xn) 


1 xi —X) (x; —X 
= 076; — 0? day iss) NGS dy a > 
n Šk Ok — *¥) 
where 6;; = 1 when į = j and is 0 otherwise. (Hint: Use Theorems 3.3.2 and 10.3.3.) 


10.3.21 Establish that (10.3.11) is distributed N(0, 1) when S is replaced by ø in the 
denominator. (Hint: Use Theorem 4.6.1 and Problem 10.3.20.) 


10.3.22 (Prediction intervals) Under the assumptions of Theorem 10.3.6, prove that 
the interval 


1 ( z) 1/2 
Xj —X 
bi +box ts] 14+ —-+ =] J stagyy p(n -2), 
( n te) me 


based on independent (x1, y1),..., Œn, Yn), will contain Y with probability equal to 
y for a future independent (X, Y) with X = x. (Hint: Theorems 4.6.1 and 3.3.2 and 
Corollary 10.3.1.) 

10.3.23 Consider the regression model with no intercept, given by E(Y | X = x) = 
Bx,where £ € R! is unknown. Suppose we observe the independent values x1, y1), 
e+ Xn, Yn). 

(a) Determine the least-squares estimate of £. 

(b) Prove that the least-squares estimate b of 2 is unbiased and, when Var(Y | X = x) = 
o°, prove that 


Var(B |X, = x1, ..., Xn = Xn) = ——S 


(c) Under the assumptions given in part (b), prove that 


1 n 
oF 5 he2 
S srera bxi) 


is an unbiased estimator of o?. 

(d) Record an appropriate ANOVA decomposition for this model and a formula for R?, 
measuring the proportion of the variation observed in Y due to changes in X. 

(e) When Y | X =x ~ N(x, o°), and we observe the independent values (x1, y1), 
..+5 (Xn, Yn), prove that b ~ N(B,07/ X3 x?). 

(f) Under the assumptions of part (e), and assuming that (n — 1)S?/o? ~ y°(n — 1) 
independent of B (this can be proved), indicate how you would test the null hypothesis 
of no relationship between Y and X. 


Chapter 10: Relationships Among Variables 577 


(g) How would you define standardized residuals for this model and use them to check 

model validity? 

10.3.24 For data (x1, y1), -.., Œn, Yn), prove that if a; = fı + Box and a2 = fo, 

then >“, (i — B, — Boxi)* equals 

n 
(i — 3) +n(a1 — 7? +43 

=l i 


(x; — 7} —2a2 So —x)Qi — Y). 
= i=l 


i 1 


From this, deduce that y and a = X7; (x; —*)0; — F)/ $! @; — 7)? are the least 
squares of a, and a2, respectively. 

10.3.25 For the model discussed in Section 10.3.3, prove that the prior given by a | a2, 
o? ~ N(uy, tło’), az|o? ~ N(u2, tż0°), and 1/o* ~ Gamma(x, v) leads to the 
posterior distribution stated there. Conclude that this prior is conjugate with the poste- 
rior distribution, as specified. (Hint: The development is similar to Example 7.1.4, as 
detailed in Section 7.5.) 

10.3.26 For the model specified in Section 10.3.3, prove that when t1 > œœ, t2 > 
oo, and v — 0, the posterior distribution of a1 is given by the distribution of y + 
(2vxy/n (2x + ny)? Z, where Z ~ t(2x +n) and vxy = (c3 — a°c2)/2. 


CHALLENGES 


10.3.27 If X1, ..., Xn is a sample from a distribution with finite variance, then prove 
that > 
X; —X as. 
— > 0. 


Dti (Xx = vay 


10.4 | Quantitative Response and Categorical 
Predictors 


In this section, we consider the situation in which the response is quantitative and the 
predictors are categorical. There can be many categorical predictors, but we restrict 
our discussion to at most two, as this gives the most important features of the general 
case. The general case is left to a further course. 


10.4.1 | One Categorical Predictor (One-Way ANOVA) 


Suppose now that the response Y is quantitative and the predictor X is categorical, 
taking a values or levels denoted 1, ...,a. With the regression model, we assume that 
the only aspect of the conditional distribution of Y, given X = x, that changes as x 
changes, is the mean. We let 

Bi = EY |X =i) 


denote the mean response when the predictor X is at level i. Note that this is immedi- 
ately a linear regression model. 


578 Section 10.4: Quantitative Response and Categorical Predictors 


We introduce the dummy variables 


1 X=i 
X=] 9 X +i 


fori = 1,...,a. Notice that, whatever the value is of the response Y, only one of the 
dummy variables takes the value 1, and the rest take the value 0. Accordingly, we can 
write 

E(Y | Xi =x1,..., Xa) = Xa = 1x1 +- + BaXa, 
because one and only one of the x; = 1, whereas the rest are 0. This has exactly 
the same form as the model discussed in Section 10.3.4, as the X; are quantitative. 


As such, all the results of Section 10.3.4 immediately apply (we will restate relevant 
results here). 


Inferences About Individual Means 


Now suppose that we observe n; values (vit, status Yini) when X = i, and all the re- 
sponse values are independent. Note that we have a independent samples. The least- 
squares estimates of the £; are obtained by minimizing 


$ Sou- BD 


i=l] j= 


The least-squares estimates are then equal to (see Problem 10.4.14) 
bi = yi = — Syy. 


These can be shown to be unbiased estimators of the ;. 

Se that the conditional distributions of Y, given X = x, all have variance 
equal to o*, we have that the conditional variance of Y; is given by o*/nj, and the 
conditional covariance between Y; and Y; G, when i Æ j, is 0. Furthermore, under these 
conditions, an unbiased estimator of ø? is given by 


-y$ ou -3 


i=l j= 


where N =n, +- +g. 
If, in addition, we assume the normal linear regression model, namely, 


Y|X =i ~ N$; 0°), 


then Y; ~ N(B;, 07/n;) independent of (N — a) S?/o? ~ y? (N — a). Therefore, by 
Definition 4.6.2, 


Chapter 10: Relationships Among Variables 579 


which leads to a y -confidence interval of the form 


_ S 
Pit 02N —a) 
L 


Ti 


for £;. Also, we can test the null hypothesis Ho : 2; = Pio by computing the P-value 
Yi — Bio Yi — Bio 


e(r ezg) (ak). 


where G(-; N — a) is the cdf of the t (N — a) distribution. Note that these inferences 
are just like those derived in Section 6.3 for the location-scale normal model, except 
we now use a different estimator of o? (with more degrees of freedom). 


Inferences about Differences of Means and Two Sample 
Inferences 


Often we want to make inferences about a difference of means £; — £;. Note that 
E(Y¥; — Y;) = B; — B; and 
Var(Y; — Yj) = Var(Y;) + Var(Y;) =o7(1/nj + 1/nj;) 


because Y; and Y; are independent. By Theorem 4.6.1, 


¥; — ¥j ~ NG; = Bj,07°(1/n; + 1/nj)). 
Furthermore, p 7 
Œ: — ¥i) - (Bi - Bi) 
o(1/nj + 1/n;)!/? 
independent of (N — a) S*/a* ~ y? (N — a). Therefore, by Definition 4.6.2, 


fa Sea Fah) p Cas 
a(1/nj +1/n;)\/? (N —a)o? 
(Yi — Yj) - (b: - B;) 


= Scan O (10.4.1) 


~ N(0, 1) 


This leads to the y -confidence interval 


S E 1 1 
Vi — j +s, |— + — ta+)/2(N — a) 
Ni nj 


for the difference of means f; — J j. We can test the null hypothesis Ho : 2; = fj, i.e., 
that the difference in the means equals 0, by computing the P-value 


P |T >  Ż=]|] =2[1-6 


580 Section 10.4: Quantitative Response and Categorical Predictors 


When a = 2, Le., there are just two values for X, we refer to (10.4.1) as the 
two-sample t-statistic, and the corresponding inference procedures are called the two- 
sample t-confidence interval and the two-sample t-test for the difference of means. In 
this case, if we conclude that £2; # £2, then we are saying that a relationship exists 
between Y and X. 


The ANOVA for Assessing a Relationship with the Predictor 


Suppose, in the general case when a > 2, we are interested in assessing whether or not 
there is a relationship between the response and the predictor. There is no relationship 
if and only if all the conditional distributions are the same; this is true, under our 
assumptions, if and only if 6; = --- = f,, i.e., if and only if all the means are equal. 
So testing the null hypothesis that there is no relationship between the response and the 
predictor is equivalent to testing the null hypothesis Ho : 81 =--- = Ja = £ for some 
unknown £. 

If the null hypothesis is true, the least-squares estimate of $ is given by y, the 
overall average response value. In this case, we have that the total variation decomposes 
as (see Problem 10.4.15) 


YY oy - 5) = Xni Gi -7P + YY (vi -7Y , 
i=l 


i=l j=l i=l j=l 


and so the relevant ANOVA table for testing Hp is given below. 


Df Sum of Squares Mean Square 
x a-1l Xini Oi- Y) Dia ti Oi — PY / a- 1) 


4 =A a 
Eror | N-a Xi- È; (ij yi) 5? 

í -\2 
Total |N—-1 Sy", (vi —3) 


To assess Hp, we use the F-statistic 
Dia ni Vi —jP /(a-1) 
p= SE 


S 


because, under the null hypothesis, both the numerator and the denominator are un- 
biased estimators of ø?. When the null hypothesis is false, the numerator tends to be 
larger than o*. When we add the normality assumption, we have that F ~ F(a —1, N 
—a), and so we compute the P-value 


e(r Shim Gi een) 


s2 


to assess whether the observed value of F is so large as to be surprising. Note that 
when a = 2, this P-value equals the P-value obtained via the two-sample t-test. 


Chapter 10: Relationships Among Variables 581 


Multiple Comparisons 


If we reject the null hypothesis of no differences among the means, then we want to 
see where the differences exist. For this, we use inference methods based on (10.4.1). 
Of course, we have to worry about the problem of multiple comparisons, as discussed 
in Section 9.3. Recall that this problem arises whenever we are testing many null 
hypotheses using a specific critical value, such as 5%, as a cutoff for a P-value, to 
decide whether or not a difference exists. The cutoff value for an individual P-value 
is referred to as the individual error rate. In effect, even if no differences exist, the 
probability of concluding that at least one difference exists, the family error rate, can 
be quite high. 

There are a number of procedures designed to control the family error rate when 
making multiple comparisons. The simplest is to lower the individual error rate, as 
the family error rate is typically an increasing function of this quantity. This is the 
approach we adopt here, and we rely on statistical software to compute and report the 
family error rate for us. We refer to this procedure as Fisher’s multiple comparison 
test. 


Model Checking 


To check the model, we look at the standardized residuals (see Problem 10.4.17) given 
by E 
Yij — Yi 


E re 
sbe 


We will restrict our attention to various plots of the standardized residuals for model 
checking. 
We now consider an example. 


EXAMPLE 10.4.1 

A study was undertaken to determine whether or not eight different types of fat are 

absorbed in different amounts during the cooking of donuts. Results were collected 

based on cooking six different donuts and then measuring the amount of fat in grams 

absorbed. We take the variable X to be the type of fat and use the model of this section. 
The collected data are presented in the following table. 


(10.4.2) 


A normal probability plot of the standardized residuals is provided in Figure 10.4.1. 
A plot of the standardized residuals against type of fat is provided in Figure 10.4.2. 


582 Section 10.4: Quantitative Response and Categorical Predictors 


Neither plot gives us significant grounds for concern over the validity of the model, 
although there is some indication of a difference in the variability of the response as 
the type of fat changes. Another useful plot in this situation is a side-by-side boxplot, 
as it shows graphically where potential differences may lie. Such a plot is provided in 
Figure 10.4.3. 

The following table gives the mean amounts of each fat absorbed. 


Fat 1 Fat 2 Fat 3 Fat 4 Fat 5 Fat 6 Fat 7 Fat 8 
172.00 177.83 182.17 184.50 165.50 176.33 161.33 162.33 


The grand mean response is given by 172.8. 


2 54 e 
m Pia 
S . 

SBS ıd e 

7) 

g ms 

BS oo 

Boo e 

E Pa 

oO 

Ea 

S e” 

N ee? 

oi è 

T T T T T 
2 1 0 1 2 


Normal Score 


Figure 10.4.1: Normal probability plot of the standardized residuals in Example 10.4.1. 


2° e 
e 
2 s ° 
O . 
ad * ° 8 
N 
(ob) o 
E e e x e 
e 
3 0 o $ . 8 ° 
D ° ° © 3 ° 
2 e ry 
© ‘ 3 e 
eas e ° 
iv) e e e 
a e a ° 
oe o * 


T T T T T T T T 
1 2 3 4 5 6 7 8 


Type of Fat 


Figure 10.4.2: Standardized residuals versus type of fat in Example 10.4.1. 


Chapter 10: Relationships Among Variables 583 


190 — 


> 170 4 


160 — | 


140 — 


l l l l l l l 
1 2 3 4 5 6 7 8 


Type of Fat 


Figure 10.4.3: Side-by-side boxplots of the response versus type of fat in Example 10.4.1. 


To assess the null hypothesis of no differences among the types of fat, we calculate 
the following ANOVA table. 


Df Sumof Squares Mean Square 


X 


Error 
Total 


Then we use the F-statistic given by F = 478/145 = 3.3. Because F ~ F (7, 40) 
under Ho, we obtain the P-value P (F > 3.3) = 0.007. Therefore, we conclude that 
there is a difference among the fat types at the 0.05 level. 

To ascertain where the differences exist, we look at all pairwise differences. There 
are 8 - 7/2 = 28 such comparisons. If we use the 0.05 level to determine whether 
or not a difference among means exists, then software computes the family error rate 
as 0.481, which seems uncomfortably high. When we use the 0.01 level, the family 
error rate falls to 0.151. With the individual error rate at 0.003, the family error rate is 
0.0546. Using the individual error rate of 0.003, the only differences detected among 
the means are those between Fat 4 and Fat 7, and Fat 4 and Fat 8. Note that Fat 4 has 
the highest absorption whereas Fats 7 and 8 have the lowest absorptions. 

Overall, the results are somewhat inconclusive, as we see some evidence of dif- 
ferences existing, but we are left with some anomalies as well. For example, Fats 4 
and 5 are not different and neither are Fats 7 and 5, but Fats 4 and 7 are deemed to be 
different. To resolve such conflicts requires either larger sample sizes or a more refined 
experiment so that the comparisons are more accurate. E 


584 Section 10.4: Quantitative Response and Categorical Predictors 


10.4.2 | Repeated Measures (Paired Comparisons) 


Consider k quantitative variables Y1, ..., Yp defined on a population II. Suppose that 
our purpose is to compare the distributions of these variables. Typically, these will be 
similar variables, all measured in the same units. 


EXAMPLE 10.4.2 

Suppose that II is a set of students enrolled in a first-year program requiring students 
to take both calculus and physics, and we want to compare the marks achieved in these 
subjects. If we let Yı denote the calculus grade and Y2 denote the physics grade, then 
we want to compare the distributions of these variables. E 


EXAMPLE 10.4.3 

Suppose we want to compare the distributions of the duration of headaches for two 
treatments (A and B) ina population of migraine headache sufferers. We let Y; denote 
the duration of a headache after being administered treatment A, and let Y2 denote the 
duration of a headache after being administered treatment B. E 


The repeated-measures approach to the problem of comparing the distributions of 
Y,,..., Yk, involves taking a random sample z ,...,2, from II and, for each z;, 
obtaining the k-dimensional value (Yı (z;),..., Ye(a)) = Oil, ---, Yik). This gives 
a sample of n from a k-dimensional distribution. Obviously, this is called repeated 
measures because we are taking the measurements Y\(z;),..., Y¢(a;) on the same 
Mi: 

An alternative to repeated measures is to take k independent samples from II 
and, for each of these samples, to obtain the values of one and only one of the vari- 
ables Y;. There is an important reason why the repeated-measures approach is pre- 
ferred: We expect less variation in the values of differences, like Y; — Y;, under 
repeated-measures sampling, than we do under independent sampling because the val- 
ues Y\(z),..., ¥;(a) are being taken on the same member of the population in re- 
peated measures. 

To see this more clearly, suppose all of the variances and covariances exist for the 
joint distribution of Y|,..., Yg. This implies that 


Var(Y; — Y;) = Var(¥;) + Var(Y;) — 2 Cov(¥;, Y;). (10.4.3) 


Because Y; and Y; are similar variables, being measured on the same individual, we 
expect them to be positively correlated. Now with independent sampling, we have that 
Var(Y; — Yj) =Var(Y;) + Var(Y;), so the variances of differences should be smaller 
with repeated measures than with independent sampling. 

When we assume that the distributions of the Y; differ at most in their means, then it 
makes sense to make inferences about the differences of the population means u; — 4 j, 
using the differences of the sample means y; — y;. In the repeated-measures context, 
we can write 


ae I% 
-I= S Oui - yy). 
1zi 


Chapter 10: Relationships Among Variables 585 


Because the individual components of this sum are independent and so, 


GS Var(Y;) + Var(Y;) — 2 Cov (Y;, Y; 
mao ee 
n 
We can consider the differences d1 = yj — y1;,..., dn = Yni — Ynj to be a sample 
of n from a one-dimensional distribution with mean x; — u ; and variance g? given by 


(10.4.3). Accordingly, we estimate u; — u; by d=j;-y 'j and estimate o? by 


Pact Gan” (10.4.4) 


C= le 


If we assume that the joint distribution of Y1, ..., Yx is multivariate normal (this 
means that any linear combination of these variables is normally distributed — see 
Problem 9.1.18), then this forces the distribution of Y; — Y; to be N (u; — Hj, o°). 
Accordingly, we have all the univariate techniques discussed in Chapter 6 for inferences 
about u; — 4j. 

The discussion so far has been about whether the distributions of variables differed. 
Assuming these distributions differ at most in their means, this leads to a comparison of 
the means. We can, however, record an observation as (X, Y), where X takes values 
in {1,...,k} and X = i means that Y = Y;. Then the conditional distribution of Y 
given X = i is the same as the distribution of Y;. Therefore, if we conclude that the 
distributions of the Y; are different, we can conclude that a relationship exists between 
Y and X. In Example 10.4.2, this means that a relationship exists between a student’s 
grade and whether or not the grade was in calculus or physics. In Example 10.4.3, this 
means that a relationship exists between length of a headache and the treatment. 

When can we assert that such a relationship is in fact a cause-effect relationship? 
Applying the discussion in Section 10.1.2, we know that we have to be able to assign 
the value of X to a randomly selected element of the population. In Example 10.4.2, 
we see this is impossible, so we cannot assert that such a relationship is a cause-effect 
relationship. In Example 10.4.3, however, we can indeed do this — namely, for a 
randomly selected individual, we randomly assign a treatment to the first headache 
experienced during the study period and then apply the other treatment to the second 
headache experienced during the study period. 

A full discussion of repeated measures requires more advanced concepts in statis- 
tics. We restrict our attention now to the presentation of an example when k = 2, 
which is commonly referred to as paired comparisons. 


EXAMPLE 10.4.4 Blood Pressure Study 

The following table came from a study of the effect of the drug captopril on blood 
pressure, as reported in Applied Statistics, Principles and Examples by D. R. Cox and 
E. J. Snell (Chapman and Hall, London, 1981). Each measurement is the difference in 
the systolic blood pressure before and after having been administered the drug. 


=9 a Ă 3 =20 
-31 —17 -26 296 -10 


—23 —33 —19 -19 -23 


586 Section 10.4: Quantitative Response and Categorical Predictors 


Figure 10.4.4 is a normal probability plot for these data and, because this looks rea- 
sonable, we conclude that the inference methods based on the assumption of normality 
are acceptable. Note that here we have not standardized the variable first, so we are 
only looking to see if the plot is reasonably straight. 


Normal score 
oO 
| 
e 


-32 -24 -16 -8 o 
Blood pressure difference 


Figure 10.4.4: Normal probability plot for the data in Example 10.4.4. 


The mean difference is given by d = —18.93 with standard deviation s = 9.03. 
Accordingly, the standard error of the estimate of the difference in the means, using 
(10.4.4), is given by s/v/15 = 2.33. A 0.95-confidence interval for the difference in 
the mean systolic blood pressure, before and after being administered captopril, is then 


A to975(n — 1) = —18.93 + 2.33 to.975(14) = (—23.93, —13.93). 
Jn 
Because this does not include 0, we reject the null hypothesis of no difference in the 
means at the 0.05 level. The actual P-value for the two-sided test is given by 


P(\T| > | — 18.93/2.33|) = 0.000 


because T ~ ¢(14) under the null hypothesis Ho that the means are equal. Therefore, 
we have strong evidence against Ho. It seems that we have strong evidence that the 
drug is leading to a drop in blood pressure. E 


10.4.3 | Two Categorical Predictors (Two-Way ANOVA) 


Now suppose that we have a single quantitative response Y and two categorical pre- 
dictors A and B, where A takes a levels and B takes b levels. One possibility is to 
consider running two one-factor studies. One study will examine the relationship be- 
tween Y and A, and the second study will examine the relationship between Y and B. 
There are several disadvantages to such an approach, however. 

First, and perhaps foremost, doing two separate analyses will not allow us to de- 
termine the joint relationship A and B have with Y. This relates directly to the concept 


Chapter 10: Relationships Among Variables 587 


of interaction between predictors. We will soon define this concept more precisely, 
but basically, if 4 and B interact, then the conditional relationship between Y and 4, 
given B = j, changes in some substantive way as we change j. If the predictors A 
and B do not interact, then indeed we will be able to examine the relationship between 
the response and each of the predictors separately. But we almost never know that this 
is the case beforehand and must assess whether or not an interaction exists based on 
collected data. 

A second reason for including both predictors in the analysis is that this will often 
lead to a reduction in the contribution of random error to the results. By this, we mean 
that we will be able to explain some of the observed variation in Y by the inclusion 
of the second variable in the model. This depends, however, on the additional variable 
having a relationship with the response. Furthermore, for the inclusion of a second 
variable to be worthwhile, this relationship must be strong enough to justify the loss in 
degrees of freedom available for the estimation of the contribution of random error to 
the experimental results. As we will see, including the second variable in the analysis 
results in a reduction in the degrees of freedom in the Error row of the ANOVA table. 
Degrees of freedom are playing the role of sample size here. The fewer the degrees of 
freedom in the Error row, the less accurate our estimate of o? will be. 

When we include both predictors in our analysis, and we have the opportunity to 
determine the sampling process, it is important that we cross the predictors. By this, 
we mean that we observe Y at each combination 


(A, B) = (i, j) € {l,...,a} x {1,..., b}. 


Suppose, then, that we have n;; response values at the (4, B) = (i, j) setting of the 
predictors. Then, letting 


EY | (4, B) = G, j)) = Bi 
be the mean response when A = i and B = j, and introducing the dummy variables 


E E e 
J | 0 A#iorBF i, 


we can write 


E(Y | Xi; = Xij for all i,j) = pı1X11 + 21X21 tees + BapXab 


a b 
> 2 byy. 


i=l j=l 


The relationship between Y and the predictors is completely encompassed in the changes 
in the £;; asi and j change. From this, we can see that a regression model for this sit- 
uation is immediately a linear regression model. 


588 Section 10.4: Quantitative Response and Categorical Predictors 


Inferences About Individual Means and Differences of Means 


Now let y;;, denote the Ath response value when X;; = 1. Then, as in Section 10.4.1, 
the least-squares estimate of £;; is given by 
1 nij 
bij = Vij = — S disks 

nij =I 
the mean of the observations when X;; = 1. If in addition we assume that the condi- 
tional distributions of Y, given the predictors all have variance equal to ø?, then with 
N =n +21 +--+ +p, we have that 


a b nj 
sv = N - Ab >>> Cine - Huy (10.4.5) 


i=l j=l k=1 


is an unbiased estimator of o7. Therefore, using (10.4.5), the standard error of y;; is 
given by s/, /nij. 
With the normality assumption, we have that Y; j ~ N(Bij, o? fni j), independent 
of 
(N — ab) S? 
2 


oO 


~ y? (N —ab). 


This leads to the y -confidence intervals 


_ AY 
Jij + ——ta+y)/2(N — ab) 
Nij 


V 


1 1 
Vi; — Yu Es. | — + — ta 2(N — ab 
Yij — Y yny m d+y)/ ) 


for the difference of means £;; — By). 


for B;; and 


The ANOVA for Assessing Interaction and Relationships with 
the Predictors 


We are interested in whether or not there is any relationship between Y and the pre- 
dictors. There is no relationship between the response and the predictors if and only 
if all the £;; are equal. Before testing this, however, it is customary to test the null 
hypothesis that there is no interaction between the predictors. The precise definition of 
no interaction here is that 
Pij = Hi toj 

for all i and j for some constants u; and v j, i.e., the means can be expressed additively. 
Note that if we fix B = j and let A vary, then these response curves (a response curve 
is a plot of the means of one variable while holding the value of the second variable 
fixed) are all parallel. This is an equivalent way of saying that there is no interaction 
between the predictors. 


Chapter 10: Relationships Among Variables 589 


In Figure 10.4.5, we have depicted response curves in which the factors do not in- 
teract, and in Figure 10.4.6 we have depicted response curves in which they do. Note 
that the solid lines, for example, joining £4; and £21, are there just to make it easier to 
display the parallelism (or lack thereof) and have no other significance. 


E(Y|A, B) & 


Boo 


Br 


Bu 


Figure 10.4.5: Response curves for expected response with two predictors, with A taking three 
levels and B taking two levels. Because they are parallel, the predictors do not interact. 


E(Y|A, B) A 


b2 


Biz 


Bu 


Figure 10.4.6: Response curves for expected response with two predictors, with A taking three 
levels and B taking two levels. They are not parallel, so the predictors interact. 


To test the null hypothesis of no interaction, we must first fit the model where 
Bij = Mi + vj, i.e., find the least-squares estimates of the #;; under these constraints. 
We will not pursue the mathematics of obtaining these estimates here, but rely on 
software to do this for us and to compute the sum of squares relevant for testing the 
null hypothesis of no interaction (from the results of Section 10.3.4, we know that this 


590 Section 10.4: Quantitative Response and Categorical Predictors 


is obtained by differencing the regression sum of squares obtained from the full model 
and the regression sums of squares obtained from the model with no interaction). 

If we decide that an interaction exists, then it is immediate that both A and B have 
an effect on Y (if A does not have an effect, then A and B cannot interact — see 
Problem 10.4.16); we must look at differences among the y;; to determine the form 
of the relationship. If we decide that no interaction exists, then A has an effect if and 
only if the u; vary, and B has an effect if and only if the v; vary. We can test the 
null hypothesis Hp : uw; = --- = Ha of no effect due to A and the null hypothesis 
Ho : vı = --- = vp of no effect due to V separately, once we have decided that no 
interaction exists. 

The details for deriving the relevant sums of squares for all these hypotheses are 
not covered here, but many statistical packages will produce an ANOVA table, as given 
below. 


Df Sum of Squares 
ai RSSA) 
ped RSS(B) 


(a—1)(b—1) RSS(A x B) 
a Nij = A2 
N — ab Xii YS Dia (vise ~ vu) 
a b Nij = 
N =l Xi- ape 2k (Yijk -7) 


Note that if we had included only A in the model, then there would be N — a degrees 
of freedom for the estimation of o”. By including B, we lose (N — a) — (N — ab) = 
a (b — 1) degrees of freedom for the estimation of o°. 

Using this table, we first assess the null hypothesis Ho : no interaction between A 
and B, using F ~ F((a — 1) (b — 1), N — ab) under Ap, via the P-value 


p (r _, RSS x DE-DE), 


s2 


where s” is given by (10.4.5). If we decide that no interaction exists, then we assess 
the null hypothesis Ho : no effect due to A, using F ~ F(a — 1, N — ab) under Ab, 


via the P-value 
RSS(A -1 
p(F> S8(4)/ (a Y; 
s 


and assess Ho : no effect due to B, using F ~ F(b — 1, N — ab) under Ab, via the 


P-value 
P (r > maea , 


Model Checking 


To check the model, we look at the standardized residuals given by (see Problem 
10.4.18) 


Yük- Vij (10.4.6) 


s,/1 = T/nij 


Chapter 10: Relationships Among Variables 591 


We will restrict our attention to various plots of the standardized residuals for model 
checking. 
We consider an example of a two-factor analysis. 


EXAMPLE 10.4.5 

The data in the following table come from G. E. P. Box and D. R. Cox, “An analysis of 
transformations” (Journal of the Royal Statistical Society, 1964, Series B, p. 211) and 
represent survival times, in hours, of animals exposed to one of three different types of 
poisons and allocated four different types of treatments. We let A denote the treatments 
and B denote the type of poison, so we have 3 x 4 = 12 different (4, B) combinations. 
Each combination was administered to four different animals; i.e., nj; = 4 for every i 
and j. 


3.1,4.5,4.6,4.3  8.2,11.0,8.8,7.2 4.3,4.5,6.3,7.5 4.5, 7.1, 6.6, 6.2 


B2 | 3.6,2.9,4.0,2.3 9.2,6.1,4.9,12.4 4.4,3.5,3.1,4.0 5.6, 10.2,7.1, 3.8 
2.2,2.1,1.8,2.3  3.0,3.7,3.8,2.9  2.3,2.5,2.4,2.2 3.0, 3.6,3.1, 3.3 


A normal probability plot for these data, using the standardized residuals after fit- 
ting the two-factor model, reveals a definite problem. In the above reference, a trans- 
formation of the response to the reciprocal 1/ Y is suggested, based on a more sophis- 
ticated analysis, and this indeed leads to much more appropriate standardized residual 
plots. Figure 10.4.7 is a normal probability plot for the standardized residuals based on 
the reciprocal response. This normal probability plot looks reasonable. 


3 a 
@ 
2 o 
gq 2- 
@ 
3 J 
© 
kA 1 — À 
g Pa 
N o4 P 
5 a 
g enat” 
Sorts a00? 
Yn eo? 
pedis” 
nei | | T 
-2 -1 0 1 2 


Normal Scores 


Figure 10.4.7: Normal probability plot of the standardized residuals in Example 10.4.5 using 
the reciprocal of the response. 


Figure 10.4.8 is a plot of the standardized residuals against the various (4, B) 
combinations, where we have coded the combination (i, j) as b@ — 1) + j with 
b = 3,i = 1,2,3,4, and j = 1, 2,3. This coding assigns a unique integer to each 
combination (i, 7) and is convenient when comparing scatter plots of the response for 


592 Section 10.4: Quantitative Response and Categorical Predictors 


each treatment. Again, this residual plot looks reasonable. 


34 
ry 
n 
g2 i 
F C 9 
. 

E 1 73° 
E ° ee ° e e 
NO | ° e . ° 
D 0 e © e œ e 
oO e . 
g 3 ee 6 2 e& & © © © 
s -1 ® = e 
n , R ° 

2 ° ‘ 


Fy ols lvoe Ala eae. Sy es el 
123 4 5 6 7 8 9 10 11 12 


(Treatment, Poison) 


Figure 10.4.8: Scatter plot for the data in Example 10.4.5 of the standardized residuals against 
each value of (A, B) using the reciprocal of the response. 


Below we provide the least-squares estimates of the £;; for the transformed model. 


A2 
0.24869 0.11635 0.18627 0.16897 


B2 | 0.32685 0.13934 0.27139 0.17015 
0.48027 0.30290 0.42650 0.30918 


The ANOVA table for the data, as obtained from a standard statistical package, is given 


below. 
Df Sumof Squares Mean Square 


0.20414 0.06805 
0.34877 0.17439 


0.01571 0.00262 
0.08643 0.00240 
0.65505 


From this, we determine that s = /0.00240 = 4.89898 x 1072, and so the standard 
errors of the least-squares estimates are all equal to s/2 = 0.0244949, 

To test the null hypothesis of no interaction between A and B, we have, using 
F ~ F (6, 36) under Ho, the P-value 


0.00262 
P (F 


> —— < | = P (F > 1.09) = 0.387. 
0.00240 


We have no evidence against the null hypothesis. 
So we can go on to test the null hypothesis of no effect due to A and we have, using 
F ~ F (2,36) under Ho, the P-value 


0.06805 
P{F> 
( 0.00240 


) = P(F > 28.35) = 0.000. 


Chapter 10: Relationships Among Variables 593 


We reject this null hypothesis. 
Similarly, testing the null hypothesis of no effect due to B, we have, using F ~ 
F (2, 36) under Ho, the P-value 


0.17439 
P (r > suas) = P(F > 72.66) = 0.000. 
We reject this null hypothesis as well. 

Accordingly, we have decided that the appropriate model is the additive model 
given by E(1/Y|(4,8)) = (i,j) = u; + vj (we are still using the transformed 
response 1/Y). We can also write this as E(1/Y | (4, B)) = (i, j) = (ui ta) + 
(v pS a) for any choice of a. Therefore, there is no unique estimate of the additive 
effects due to A or B. However, we still have unique least-squares estimates of the 
means, which are obtained (using software) by fitting the model with constraints on the 
Pij corresponding to no interaction existing. These are recorded in the following table. 


A2 
0.26977 0.10403 0.21255 0.13393 


B2 | 0.31663 0.15089 0.25942 0.18080 
0.46941 0.30367 0.41219 0.33357 


As we have decided that there is no interaction between A and B, we can assess 
single-factor effects by examining the response means for each factor separately. For 
example, the means for investigating the effect of A are given in the following table. 


Al A2 A3 A4 
0.352 0.186 0.295 0.216 


We can compare these means using procedures based on the ¢-distribution. For exam- 
ple, a 0.95-confidence interval for the difference in the means at levels Al and A2 is 


given by 
/0.00240 
(0.352 — 0.186) + D 2.0281 


(0.13732, 0.19468). (10.4.7) 


S 


yı. — y2. + to.975 (36 
yVe-y JE 75 (36) 


This indicates that we would reject the null hypothesis of no difference between these 
means at the 0.05 level. 

Notice that we have used the estimate of a” based on the full model in (10.4.7). 
Logically, it would seem to make more sense to use the estimate based on fitting the 
additive model because we have decided that it is appropriate. When we do so, this is 
referred to as pooling, as it can be shown that the new error estimate is calculated by 
adding RSS(4 x B) to the original ESS and dividing by the sum of the 4 x B degrees 
of freedom and the error degrees of freedom. Not to pool is regarded as a somewhat 
more conservative procedure. H 


594 Section 10.4: Quantitative Response and Categorical Predictors 


10.4.4 | Randomized Blocks 


With two-factor models, we generally want to investigate whether or not both of these 
factors have a relationship with the response Y. Suppose, however, that we know that a 
factor B has a relationship with Y, and we are interested in investigating whether or not 
another factor A has a relationship with Y. Should we run a single-factor experiment 
using the predictor A, or run a two-factor experiment including the factor B? 

The answer is as we have stated at the start of Section 10.4.2. Including the factor 
B will allow us, if B accounts for a lot of the observed variation, to make more accurate 
comparisons. Notice, however, that if B does not have a substantial effect on Y, then 
its inclusion will be a waste, as we sacrificed a(b — 1) degrees of freedom that would 
otherwise go toward the estimation of o°. 

So it is important that we do indeed know that B has a substantial effect. In such 
a case, we refer to B as a blocking variable. It is important again that the blocking 
variable B be crossed with A. Then we can test for any effect due to A by first testing 
for an interaction between A and B; if no such interaction is found, then we test for an 
effect due to A alone, just as we have discussed in Section 10.4.3. 

A special case of using a blocking variable arises when we have n;; = 1 for alli and 
j. In this case, N = ab, so there are no degrees of freedom available for the estimation 
of error. In fact, we have that (see Problem 10.4.19) s? = 0. Still, such a design has 
practical value, provided we are willing to assume that there is no interaction between 
A and B. This is called a randomized block design. 

For a randomized block design, we have that 


> RSS(4 x B) 
= G nba (10.4.8) 


is an unbiased estimate of 07, and so we have (a — 1) (b — 1) degrees of freedom for 
the estimation of error. Of course, this will not be correct if A and B do interact, 
but when they do not, this can be a highly efficient design, as we have removed the 
effect of the variation due to B and require only ab observations for this. When the 
randomized block design is appropriate, we test for an effect due to A, using F ~ 
F(a — 1, (a — 1) (b — 1)) under Ab, via the P-value 


p (r > Swe’). 


s2 


10.4.5 | One Categorical and One Quantitative Predictor 


It is also possible that the response is quantitative while some of the predictors are 
categorical and some are quantitative. We now consider the situation where we have 
one categorical predictor A, taking a values, and one quantitative predictor W. We 
assume that the regression model applies. Furthermore, we restrict our attention to the 
situation where we suppose that, within each level of A, the mean response varies as 


EY | (4, W)) = (i, w) = Bi + nw, 


Chapter 10: Relationships Among Variables 595 


so that we have a simple linear regression model within each level of A. 
If we introduce the dummy variables 


Wi Axi 
Pie | 0 A +i 
fori = 1,...,a and j = 1,2, then we can write the linear regression model as 


E(Y | (Xy) = y) = Bux + 812x12) +- + Barxai + Pa2xa2). 


Here, f;, is the intercept and £;, is the slope specifying the relationship between Y and 
W when A = i. The methods of Section 10.3.4 are then available for inference about 
this model. 

We also have a notion of interaction in this context, as we say that the two pre- 
dictors interact if the slopes of the lines vary across the levels of A. So saying that 
no interaction exists is the same as saying that the response curves are parallel when 
graphed for each level of A. If an interaction exists, then it is definite that both 4A and 
W have an effect on Y. Thus the null hypothesis that no interaction exists is equivalent 
to Ho : Biz = +++ = Baz. 

If we decide that no interaction exists, then we can test for no effect due to W by 
testing the null hypothesis that the common slope is equal to 0, or we can test the null 
hypothesis that there is no effect due to A by testing Ho : Bj; = ++- = fay, 1e., that 
the intercept terms are the same across the levels of A. 

We do not pursue the analysis of this model further here. Statistical software is 
available, however, that will calculate the relevant ANOVA table for assessing the var- 
ious null hypotheses. 


Analysis of Covariance 


Suppose we are running an experimental design and for each experimental unit we can 
measure, but not control, a quantitative variable W that we believe has an effect on the 
response Y. If the effect of this variable is appreciable, then good statistical practice 
suggests we should include this variable in the model, as we will reduce the contri- 
bution of error to our experimental results and thus make more accurate comparisons. 
Of course, we pay a price when we do this, as we lose degrees of freedom that would 
otherwise be available for the estimation of error. So we must be sure that W does have 
a significant effect in such a case. Also, we do not test for an effect of such a variable, 
as we presumably know it has an effect. This technique is referred to as the analysis of 
covariance and is obviously similar in nature to the use of blocking variables. 


Summary of Section 10.4 


e We considered the situation involving a quantitative response and categorical 
predictor variables. 
e By the introduction of dummy variables for the predictor variables, we can con- 


sider this situation as a particular application of the multiple regression model of 
Section 10.3.4. 


596 Section 10.4: Quantitative Response and Categorical Predictors 


e If we decide that a relationship exists, then we typically try to explain what 
form this relationship takes by comparing means. To prevent finding too many 
statistically significant differences, we lower the individual error rate to ensure a 
sensible family error rate. 


e When we have two predictors, we first check to see if the factors interact. If the 
two predictors interact, then both have an effect on the response. 


A special case of a two-way analysis arises when one of the predictors serves as 
a blocking variable. It is generally important to know that the blocking variable 
has an effect on the response, so that we do not waste degrees of freedom by 
including it. 

e Sometimes we can measure variables on individual experimental units that we 
know have an effect on the response. In such a case, we include these variables 
in our model, as they will reduce the contribution of random error to the analysis 
and make our inferences more accurate. 


EXERCISES 


10.4.1 The following values of a response Y were obtained for three settings of a 
categorical predictor A. 


2.9976 0.3606 4.7716 1.5652 
A=2 | 0.7468 1.3308 2.2167 —0.3184 


2.1192 2.3739 0.3335 3.3015 


Suppose we assume the normal regression model for these data with one categorical 
predictor. 

(a) Produce a side-by-side boxplot for the data. 

(b) Plot the standardized residuals against A (if you are using a computer for your cal- 
culations, also produce a normal probability plot of the standardized residuals). Does 
this give you grounds for concern that the model assumptions are incorrect? 

(c) Carry out a one-way ANOVA to test for any difference among the conditional means 
of Y given A. 

(d) If warranted, construct 0.95-confidence intervals for the differences between the 
means and summarize your findings. 

10.4.2 The following values of a response Y were obtained for three settings of a 
categorical predictor A. 


0.090 0.800 33.070 —1.890 
5.120 1.580 1.760 1.740 


5.080 —3.510 4.420 1.190 


Suppose we assume the normal regression model for these data with one categorical 
predictor. 


(a) Produce a side-by-side boxplot for the data. 


Chapter 10: Relationships Among Variables 597 


(b) Plot the standardized residuals against A (if you are using a computer for your cal- 
culations, also produce a normal probability plot of the standardized residuals). Does 
this give you grounds for concern that the model assumptions are incorrect? 

(c) If concerns arise about the validity of the model, can you “fix” the problem? 

(d) If you have been able to fix any problems encountered with the model, carry out a 
one-way ANOVA to test for any differences among the conditional means of Y given 
A. 

(e) If warranted, construct 0.95-confidence intervals for the differences between the 
means and summarize your findings. 

10.4.3 The following table gives the percentage moisture content of two different types 
of cheeses determined by randomly sampling batches of cheese from the production 
process. 


Cheese 1 | 39.02, 38.79, 35.74, 35.41, 37.02, 36.00 
Cheese 2 | 38.96, 39.01, 35.58, 35.52, 35.70, 36.04 


Suppose we assume the normal regression model for these data with one categorical 
predictor. 

(a) Produce a side-by-side boxplot for the data. 

(b) Plot the standardized residuals against Cheese (if you are using a computer for 
your calculations, also produce a normal probability plot of the standardized residuals). 
Does this give you grounds for concern that the model assumptions are incorrect? 

(c) Carry out a one-way ANOVA to test for any differences among the conditional 
means of Y given Cheese. Note that this is the same as a t-test for the difference in the 
means. 

10.4.4 In an experiment, rats were fed a stock ration for 100 days with various amounts 
of gossypol added. The following weight gains in grams were recorded. 


ae 1 | 228, 229, 218, 216, 224, 208, 235, 229, 
.00% Gossypo 233, 219, 224, 220, 232, 200, 208, 232 
E a 186 = a 208, 228, 198, 222, 273, 


179, 193, 183, 180, 143, 204, 114, 188 
0 > > 
0.07% Gossypol | 178’ 134 208, 196 


130, 87, 135, 116, 118, 165, 151, 59 
0. a > > > > a $ > 
0.10% Gossypol | 156’ 64.78, 94, 150, 160, 122, 110, 178 


154, 130, 118, 118, 118, 104, 112, 134 
0 > 3 3 r > > > > 
0.13% Gossypol 98, 100, 104 


Suppose we assume the normal regression model for these data and treat gossypol as a 
categorical predictor taking five levels. 


(a) Create a side-by-side boxplot graph for the data. Does this give you any reason 
to be concerned about the assumptions that underlie an analysis based on the normal 
regression model? 

(b) Produce a plot of the standardized residuals against the factor gossypol (if you are 
using a computer for your calculations, also produce a normal probability plot of the 
standardized residuals). What are your conclusions? 


598 Section 10.4: Quantitative Response and Categorical Predictors 


(c) Carry out a one-way ANOVA to test for any differences among the mean responses 
for the different amounts of gossypol. 

(d) Compute 0.95-confidence intervals for all the pairwise differences of means and 
summarize your conclusions. 

10.4.5 In an investigation into the effect of deficiencies of trace elements on a variable 
Y measured on sheep, the data in the following table were obtained. 


Control 13.2, 13.6, 11.9, 13.0, 14.5, 13.4 
Cobalt 11.9, 12.2, 13.9, 12.8, 12.7, 12.9 


Copper 14.2, 14.0, 15.1, 14.9, 13.7, 15.8 
Cobalt + Copper | 15.0, 15.6, 14.5, 15.8, 13.9, 14.4 


Suppose we assume the normal regression model for these data with one categorical 
predictor. 

(a) Produce a side-by-side boxplot for the data. 

(b) Plot the standardized residuals against the predictor (if you are using a computer for 
your calculations, also produce a normal probability plot of the standardized residuals). 
Does this give you grounds for concern that the model assumptions are incorrect? 

(c) Carry out a one-way ANOVA to test for any differences among the conditional 
means of Y given the predictor. 

(d) If warranted, construct 0.95-confidence intervals for all the pairwise differences 
between the means and summarize your findings. 


10.4.6 Two diets were given to samples of pigs over a period of time, and the following 
weight gains (in lbs) were recorded. 


Diet A | 8,4, 14, 15, 11, 10, 6, 12, 13,7 
Diet B | 7, 13,22, 15, 12, 14, 18, 8, 21, 23, 10, 17 


Suppose we assume the normal regression model for these data. 

(a) Produce a side-by-side boxplot for the data. 

(b) Plot the standardized residuals against Diet. Also produce a normal probability plot 
of the standardized residuals. Does this give you grounds for concern that the model 
assumptions are incorrect? 

(c) Carry out a one-way ANOVA to test for a difference between the conditional means 
of Y given Diet. 

(d) Construct 0.95-confidence intervals for differences between the means. 

10.4.7 Ten students were randomly selected from the students in a university who took 
first-year calculus and first-year statistics. Their grades in these courses are recorded 
in the following table. 


[Student [I 2 3 4 5 6 7 8 9 10 


Calculus | 66 61 77 62 66 68 64 75 59 71 
Statistics | 66 63 79 63 67 70 71 80 63 74 


Suppose we assume the normal regression model for these data. 


Chapter 10: Relationships Among Variables 599 


(a) Produce a side-by-side boxplot for the data. 

(b) Treating the calculus and statistics marks as separate samples, carry out a one-way 
ANOVA to test for any difference between the mean mark in calculus and the mean 
mark in statistics. Produce the appropriate plots to check for model assumptions. 

(c) Now take into account that each student has a calculus mark and a statistics mark 
and test for any difference between the mean mark in calculus and the mean mark in 
Statistics. Produce the appropriate plots to check for model assumptions. Compare 
your results with those obtained in part (b). 

(d) Estimate the correlation between the calculus and statistics marks. 

10.4.8 The following data were recorded in Statistical Methods, 6th ed., by G. Snedecor 
and W. Cochran (Iowa State University Press, Ames, 1967) and represent the average 
number of florets observed on plants in seven plots. Each of the plants was planted 
with either high corms or low corms (a type of underground stem). 


PY Plot! Plot2 Plot3 Plot4 PlotS Plot6 Plot7 


Corm High | 11.2 13.3 12.8 13.7 12.2 11.9 12.1 
Corm Low | 14.6 12.6 15.0 15.6 12.7 12.0 13.1 


Suppose we assume the normal regression model for these data. 


(a) Produce a side-by-side boxplot for the data. 

(b) Treating the Corm High and Corm Low measurements as separate samples, carry 
out a one-way ANOVA to test for any difference between the population means. Pro- 
duce the appropriate plots to check for model assumptions. 

(c) Now take into account that each plot has a Corm High and Corm Low measurement. 
Compare your results with those obtained in part (b). Produce the appropriate plots to 
check for model assumptions. 

(d) Estimate the correlation between the calculus and statistics marks. 

10.4.9 Suppose two measurements, Yı and Y2, corresponding to different treatments, 
are taken on the same individual who has been randomly sampled from a population 
II. Suppose that Yı and Y2 have the same variance and are negatively correlated. Our 
goal is to compare the treatment means. Explain why it would have been better to 
have randomly sampled two individuals from ITI and applied the treatments to these 
individuals separately. (Hint: Consider Var(Y; — Y2) in these two sampling situations.) 
10.4.10 List the assumptions that underlie the validity of the one-way ANOVA test 
discussed in Section 10.4.1. 

10.4.11 List the assumptions that underlie the validity of the paired comparison test 
discussed in Section 10.4.2. 

10.4.12 List the assumptions that underlie the validity of the two-way ANOVA test 
discussed in Section 10.4.3. 

10.4.13 List the assumptions that underlie the validity of the test used with the ran- 
domized block design, discussed in Section 10.4.4, when n;; = 1 for alli and j. 


600 Section 10.4: Quantitative Response and Categorical Predictors 


PROBLEMS 


10.4.14 Prove that X$] È (Yi; — Bi) is minimized as a function of the £; by 
Bi = Hi = Wü +- + Yin) /ni fori =1,..., 4. 
10.4.15 Prove that 


a nj 


>> by sy) = Èr Gi- DDA — 5) 


i=l j=l i=l j=1 


where y; = (vit ps + Yin) /n; and y is the grand mean. 

10.4.16 Argue that if the relationship between a quantitative response Y and two cat- 
egorical predictors A and B is given by a linear regression model, then A and B both 
have an effect on Y whenever A and B interact. (Hint: What does it mean in terms of 
response curves for an interaction to exist, for an effect due to A to exist?) 

10.4.17 Establish that (10.4.2) is the appropriate expression for the standardized resid- 
ual for the linear regression model with one categorical predictor. 

10.4.18 Establish that (10.4.6) is the appropriate expression for the standardized resid- 
ual for the linear regression model with two categorical predictors. 

10.4.19 Establish that s? = 0 for the linear regression model with two categorical 
predictors when n;; = 1 for all i and j. 

10.4.20 How would you assess whether or not the randomized block design was ap- 
propriate after collecting the data? 


COMPUTER PROBLEMS 


10.4.21 Use appropriate software to carry out Fisher’s multiple comparison test on the 
data in Exercise 10.4.5 so that the family error rate is between 0.04 and 0.05. What 
individual error rate is required? 

10.4.22 Consider the data in Exercise 10.4.3, but now suppose we also take into ac- 
count that the cheeses were made in lots where each lot corresponded to a production 
run. Recording the data this way, we obtain the following table. 


Cheese 1 | 39.02,38.79 35.74,35.41 37.02, 36.00 

Cheese 2 | 38.96,39.01 35.58,35.52 35.70, 36.04 
Suppose we assume the normal regression model for these data with two categorical 
predictors. 


(a) Produce a side-by-side boxplot for the data for each treatment. 

(b) Produce a table of cell means. 

(c) Produce a normal probability plot of the standardized residuals and a plot of the 
standardized residuals against each treatment combination (code the treatment combi- 
nations so there is a unique integer corresponding to each). Comment on the validity 
of the model. 


Chapter 10: Relationships Among Variables 601 


(d) Construct the ANOVA table testing first for no interaction between A and B and, if 
necessary, an effect due to A and an effect due to B. 

(e) Based on the results of part (d), construct the appropriate table of means, plot the 
corresponding response curve, and make all pairwise comparisons among the means. 
(f) Compare your results with those obtained in Exercise 10.4.4 and comment on the 
differences. 

10.4.23 A two-factor experimental design was carried out, with factors A and B both 
categorical variables taking three values. Each treatment was applied four times and 
the following response values were obtained. 


A=1 A=2 A=3 


| 
Bu | 19-86 2088 2637 2438 29.72 29.64 
= | 20.15 25.44 24.87 30.93 30.06 35.49 


15.35 15.86 22.82 20.98 27.12 24.27 
21.86 26.92 29.38 34.13 34.78 40.72 
4.01 4.48 10.34 9.38 15.64 14.03 
21.66 25.93 30.59 40.04 36.80 42.55 
Suppose we assume the normal regression model for these data with two categorical 
predictors. 


(a) Produce a side-by-side boxplot for the data for each treatment. 

(b) Produce a table of cell means. 

(c) Produce a normal probability plot of the standardized residuals and a plot of the 
standardized residuals against each treatment combination (code the treatment combi- 
nations so there is a unique integer corresponding to each). Comment on the validity 
of the model. 

(d) Construct the ANOVA table testing first for no interaction between A and B and, if 
necessary, an effect due to A and an effect due to B. 

(e) Based on the results of part (d), construct the appropriate table of means, plot the 
corresponding response curves, and make all pairwise comparisons among the means. 
10.4.24 A chemical paste is made in batches and put into casks. Ten delivery batches 
were randomly selected for testing; then three casks were randomly selected from each 
delivery and the paste strength was measured twice, based on samples drawn from each 
sampled cask. The response was expressed as a percentage of fill strength. The col- 
lected data are given in the following table. Suppose we assume the normal regression 
model for these data with two categorical predictors. 


| | Batch 1 Batch 2 Batch 3 Batch 4 Batch 5 
62.8,62.6 60.0,61.4 58.7,57.5 57.1,56.4 55.1,55.1 
Cask 2 | 60.1,62.3 57.5,56.9 63.9,63.1 56.9,58.6 54.7, 54.2 
62.7,63.1 61.1,58.9 65.4,63.7 64.7,64.5 58.5,57.5 


Ee i] Batch 6 Batch 7 Batch 8 Batch 9 Batch 10 
63.4,64.9 62.5,62.6 59.2,59.4 54.8,54.8 58.3, 59.3 
Cask 2 | 59.3,58.1 61.0,58.7 65.2,66.0 64.0,64.0 59.2, 59.2 
60.5,60.0 56.9,57.7 64.8,64.1 57.7,56.8 58.9, 56.8 


602 Section 10.5: Categorical Response and Quantitative Predictors 


(a) Produce a side-by-side boxplot for the data for each treatment. 

(b) Produce a table of cell means. 

(c) Produce a normal probability plot of the standardized residuals and a plot of the 
standardized residuals against each treatment combination (code the treatment combi- 
nations so there is a unique integer corresponding to each). Comment on the validity 
of the model. 

(d) Construct the ANOVA table testing first for no interaction between Batch and Cask 
and, if necessary, no effect due to Batch and no effect due to Cask. 

(e) Based on the results of part (d), construct the appropriate table of means and plot 
the corresponding response curves. 

10.4.25 The following data arose from a randomized block design, where factor B is 
the blocking variable and corresponds to plots of land on which cotton is planted. Each 
plot was divided into five subplots, and different concentrations of fertilizer were ap- 
plied to each, with the response being a strength measurement of the cotton harvested. 
There were three blocks and five different concentrations of fertilizer. Note that there is 
only one observation for each block and concentration combination. Further discussion 
of these data can be found in Experimental Design, 2nd ed., by W. G. Cochran and 
G. M. Cox (John Wiley & Sons, New York, 1957, pp. 107-108). Suppose we assume 
the normal regression model with two categorical predictors. 


[___[B=1 B=? B=3 


(a) Construct the ANOVA table for testing for no effect due to fertilizer and which also 
removes the variation due to the blocking variable. 

(b) Beyond the usual assumptions that we are concerned about, what additional as- 
sumption is necessary for this analysis? 

(c) Actually, the factor A is a quantitative variable. If we were to take this into ac- 
count by fitting a model that had the same slope for each block but possibly different 
intercepts, then what benefit would be gained? 

(d) Carry out the analysis suggested in part (c) and assess whether or not this model 
makes sense for these data. 


10.5 | Categorical Response and Quantitative 
Predictors 


We now consider the situation in which the response is categorical but at least some 
of the predictors are quantitative. The essential difficulty in this context lies with the 
quantitative predictors, so we will focus on the situation in which all the predictors 


Chapter 10: Relationships Among Variables 603 


are quantitative. When there are also some categorical predictors, these can be han- 
dled in the same way, as we can replace each categorical predictor by a set of dummy 
quantitative variables, as discussed in Section 10.4.5. 

For reasons of simplicity, we will restrict our attention to the situation in which 
the response variable Y is binary valued, and we will take these values to be 0 and 1. 
Suppose, then, that there are k quantitative predictors X1, ..., X;. Because Y € {0, 1}, 
we have 


EY |X =x1,..., Xk = x4) = PW =1| X1 = x1, ..., Xe = xx) € [0, 1]. 


Therefore, we cannot write E (Y |x1,..., x4) = 1x1 +--+ + xk without placing 
some unnatural restrictions on the 2; to ensure that B)x; +---+ 2px € [0, 1]. 
Perhaps the simplest way around this is to use a 1-1 function / : [0, 1] > R! and 
write 
(PY =1|X1 =x1,..., Xk = xk)) = Byxi +--+ + pkXk, 


so that 
PŒ =1|X, =m1,..., Xp = x4) = 17! (Bix H+ + bxh). 


We refer to / as a link function. There are many possible choices for /. For example, it 
is immediate that we can take / to be any inverse cdf for a continuous distribution. 

If we take Z = @7!, i.e., the inverse cdf of the N(0, 1) distribution, then this is 
called the probit link. A more commonly used link, due to some inherent mathematical 
simplicities, is the logistic link given by 


1p) =1n( È ). (10.5.1) 
l—p 


The right-hand side of (10.5.1) is referred to as the Jogit or log odds. The logistic link 
is the inverse cdf of the logistic distribution (see Exercise 10.5.1). We will restrict our 
discussion to the logistic link hereafter. 

The logistic link implies that (see Exercise 10.5.2) 


PO Se ae a pS ee. | sey 
1 + exp{Bix1 +--+ + Byxn} 
which is a relatively simple relationship. We see immediately, however, that 


Var(Y | X1 =x1,...,Xk = Xk) 
= PYY=1|X, =x1,..., Xp) =x — PY = 1/X1 =x1,..., Xk = xk)), 


so the variance of the conditional distribution of Y, given the predictors, depends on the 
values of the predictors. Therefore, these models are not, strictly speaking, regression 
models as we have defined them. Still when we use the link function given by (10.5.1), 
we refer to this as the logistic regression model. 

Now suppose we observe n independent observations (xj1,...,Xix, Yi) fori = 
1,...,”. We then have that, given (7;1,...,X;x), the response y; is an observation 


604 Section 10.5: Categorical Response and Quantitative Predictors 


from the Bernoulli(P(Y = 1|X1 = x1,...,X% = xx)) distribution. Then (10.5.2) 
implies that the conditional likelihood, given the values of the predictors, is 


II( exp(Bixi +--+ + Aixi) yr 1 j” 
iy NL + exp{B x1 +--+ + xk} 1 + exp{Byx1 +--+ + xk} 
Inference about the £; then proceeds via the likelihood methods discussed in Chap- 
ter 6. In fact, we need to use software to obtain the MLE’s, and, because the exact 
sampling distributions of these quantities are not available, the large sample methods 
discussed in Section 6.5 are used for approximate confidence intervals and P-values. 
Note that assessing the null hypothesis Ho : 8; = 0 is equivalent to assessing the null 
hypothesis that the predictor X; does not have a relationship with the response. 
We illustrate the use of logistic regression via an example. 


EXAMPLE 10.5.1 
The following table of data represent the 


(number of failures, number of successes) 


for ingots prepared for rolling under different settings of the predictor variables, U = 
soaking time and V = heating time, as reported in Analysis of Binary Data, by D. R. 
Cox (Methuen, London, 1970). A failure indicates that an ingot is not ready for rolling 
after the treatment. There were observations at 19 different settings of these variables. 


[COC WV =T OV = OV =27: V=5I 
01,10) (0,31) (1,55) ©, 10) 
(0,17) (0,43) (4,40) (0,1) 


(0,7) (2,31) (0,21) (0,1) 
(0,12) (0,31) (1,21) (0,0) 
(0,9) (0,19 (1,15) (0,1) 


Including an intercept in the model and linear terms for U and V leads to three 
predictor variables X1 = 1, X2 = U, X3 = V, and the model takes the form 


1 + exp{P, + Box2 + B3x3} 


Fitting the model via the method of maximum likelihood leads to the estimates given 
in the following table. Here, z is the value of estimate divided by its standard error. 
Because this is approximately distributed N(0, 1) when the corresponding £; equals 
0, the P-value for assessing the null hypothesis that 6; = 0 is P(|Z| > |z|) with 
Z ~ N(O, 1). 


Estimate Std. Error Z P-value 
5.55900 1.12000 4.96 0.000 


P(Y = 1| X2 = x2, X3 = x3) 


—0.05680 0.33120 —0.17 0.864 
—0.08203 0.02373 —3.46 0.001 


Of course, we have to feel confident that the model is appropriate before we can 
proceed to make formal inferences about the £;. In this case, we note that the number 


Chapter 10: Relationships Among Variables 605 


of successes s (x2, x3) in the cell of the table, corresponding to the setting (X2, X3) = 
(x2, x3), is an observation from a 


Binomial(m(x2, x3), P(Y = 1|X2 = x2, X3 = x3)) 


distribution, where m (x2, x3) is the sum of the number of successes and failures in that 
cell. So, for example, if X2 = U = 1.0 and X3 = V = 7, then m(1.0, 7) = 10 and 
s (1.0, 7) = 10. Denoting the estimate of P(Y = 1 | X2 = x2, X3 = x3) by P(x2, x3), 
obtained by plugging in the MLE, we have that (see Problem 10.5.8) 


(s (x2, x3) — m(x2, x3) P(x2, x3))? 


X = 
m(x2, x3) P (x2, x3) 


(x2,x3) 


(10.5.3) 


is asymptotically distributed as a y? (19 — 3) = y? (16) distribution when the model is 
correct. We determine the degrees of freedom by counting the number of cells where 
there were observations (19 in this case, as no observations were obtained when U = 
2.8, V = 51) and subtracting the number of parameters estimated. For these data, 
X? = 13.543 and the P-value is P(v7(16) > 13.543) = 0.633. Therefore, we have no 
evidence that the model is incorrect and can proceed to make inferences about the £; 
based on the logistic regression model. 

From the preceding table, we see that the null hypothesis Hp : 6, = 0 is not 
rejected. Accordingly, we drop X2 and fit the smaller model given by 


exp{h) + 3x3} 


P(Y =1|X3 es arene pra 


This leads to the estimates 8, = 5.4152 and £3 = —0.08070. Note that these are only 
marginally different from the previous estimates. In Figure 10.5.1, we present a graph 
of the fitted function over the range where we have observed X3. E 


1.0 4 
wn 
wn 
Q 
[S] 
[S] 
B 
= 09 5 
fo} 
2 
5 
oO 
2 
2 
a 
08 4 
T T T T T 
10 20 30 40 50 
V 


Figure 10.5.1: The fitted probability of obtaining an ingot ready to be rolled as a function of 
heating time in Example 10.5.1. 


606 Section 10.5: Categorical Response and Quantitative Predictors 


Summary of Section 10.5 


e We have examined the situation in which we have a single binary-valued re- 
sponse variable and a number of quantitative predictors. 

e One method of expressing a relationship between the response and predictors is 
via the use of a link function. 


e If we use the logistic link function, then we can carry out a logistic regression 
analysis using likelihood methods of inference. 


EXERCISES 


10.5.1 Prove that the function f : R! > R!, defined by f(x) = e* (1 + e~*)~? for 
x € R!, is a density function with distribution function given by F(x) = (1 +e7*)7! 
and inverse cdf given by F7! (p) = In p — In(1 — p) for p € [0, 1]. This is called the 
logistic distribution. 

10.5.2 Establish (10.5.2). 


10.5.3 Suppose that a logistic regression model for a binary-valued response Y is given 
by 

exp{hi + Box} 
1 + exp{B, + Box} 
Prove that the log odds at X = x is given by 6, + Box. 
10.5.4 Suppose that instead of the inverse logistic cdf as the link function, we use 
the inverse cdf of a Laplace distribution (see Problem 2.4.22). Determine the form of 
PY =1|X, =%1,..., Xk = Xx). 
10.5.5 Suppose that instead of the inverse logistic cdf as the link function, we use 
the inverse cdf of a Cauchy distribution (see Problem 2.4.21). Determine the form of 
PY =1|[%1 =41,..., Xk = Xx). 


COMPUTER EXERCISES 


10.5.6 Use software to replicate the results of Example 10.5.1. 


10.5.7 Suppose that the following data were obtained for the quantitative predictor X 
and the binary-valued response variable Y. 


PY =1|x)= 


=) —3 —1 


Ean = 0,0 7 1,0 a 


(a) Using these data, fit the logistic regression model given by 


exp{B, + Box + 3x7} 


EE reap pe Rod 


(b) Does the model fit the data? 
(c) Test the null hypothesis Hp : £3 = 0. 


Chapter 10: Relationships Among Variables 607 


(d) If you decide there is no quadratic effect, refit the model and test for any linear 
effect. 


(e) Plot P(Y = 1 |x) as a function of x. 


PROBLEMS 


10.5.8 Prove that (10.5.3) is the correct form for the chi-squared goodness-of-fit test 
Statistic. 


10.6 | Further Proofs (Advanced) 
Proof of Theorem 10.3.1 


We want to prove that, when E (Y | X =x) = B,+ fx and we observe the independent 
values (x1, V1),---, Xn, Yn) for (X, Y), then the least-squares estimates of p; and p3 
are given by bı = y — box and 


e Dia Oi —x)Qi — y) 
Se —x)y f 


whenever X !_ Œi — 7} £0. 


We need an algebraic result that will simplify our calculations. 


Lemma 10.6.1 If (x1, y1), .-., Œn, Yn) are such that X7; Œ; Say # 0 and 


q,r € R!, then X; i Wi — bi — b2xi) (q +rx:) =0. 


PROOF | We have 


$ 0 — 1 — boxi) = ny — nbi — nba¥ =n — Y + bok — b25) =0, 


i=l 


which establishes that $7_; (v; — bı — b2x;)q = 0 for any q. Now using this, and the 
formulas in Theorem 10.3.1, we obtain 


S 0i — by — bzxi)xi 

al 

= > (yi — bı — b2x;) (x; — x) 
i=l 


Oi — 9 — bai — ¥))@: - 8) = X 0i Oi -a — Di — NG — 8) =0. 
i=l = 


1 


i=l i 


This establishes the lemma. E 


608 Section 10.6: Further Proofs (Advanced) 


Returning to the proof of Theorem 10.3.1, we have 
> 0i — Bi = Baxi? = 0% — bi — box; — (B1 — b1) — (Ba — bn) xi)? 
i=l i=l 
= $ 0i — bi = box? —2 È 0i — bi — baxi H(i — b1) + (Bo — b2)xi} 
i=l i=l 


+ DG — by) + (By — by)xi)? 


i=] 
n n 
= $ Oi -b1 — baxi)? + X (81 — b1) + Bo — box)’, 
i=l i=l 
as the middle term is 0 by Lemma 10.6.1. Therefore, 
n n 

> 0% -pi - Boxi? = 01 - bi = bax)? 
i=l i=1 

and >"_, (7 — By — 2x1)? takes its minimum value if and only if 


SCA = b1) + Bo = b2)xi)? = 0. 
=I 


This occurs if and only if (8, — b1) + (8 — 62)x; = 0 for every i. Because the x; are 
not all the same value, this is true if and only if 8, = bı and p, = b2, which completes 
the proof. E 


Proof of Theorem 10.3.2 


We want to prove that, if E(Y | X = x) = pı + B2x and we observe the independent 
values (x1, y1), ee 2g (Xn, Yn) for (X, Y), then 
(i) E(B, |X, =x1,...,Xn =Xn) =f, 


(ii) E(B. | X1 =21,...,Xn = xn) = f2- 


From Theorem 10.3.1 and E (Y |X, = x1,...,Xn = Xn) = fh, + Box, we have 
that 


7-101 — X)(B1 + Boxi — Bi — Box) 


E(B.|X, = x1,..., Xn =Xn)= S 7 —¥2 
oe Dia i — X) = 
By S” Gi} Po 


Also, from Theorem 10.3.1 and what we have just proved, 


E(Bi|X1 = x1, ---, Xn = Xn) = By + Bok — Bok = By. 


Chapter 10: Relationships Among Variables 609 


Proof of Theorem 10.3.3 
We want to prove that, if E(Y |X = x) = B, + Box, Var(Y |X = x) = o? for every 
x, and we observe the independent values (x1, y1), .. ., Xn, Yn) for (X, Y), then 
() Var(B, | X1 =x1,...,Xn = Xn) = 071 /n + ¥7/ Xj @ — ¥)), 
(ii) Var(B2| X1 =x1,...,Xn = Xn) = 0?/ $l i — *)’, 
(iii) Cov(B1, B2 | X1 =x1,...,Xn = Xn) = —0?X/ o_O — X)?. 
We first prove (ii). Observe that b2 is a linear combination of the y; — y values, 


so we can evaluate the conditional variance once we have obtained the conditional 
variances and covariances of the Y; — Y values. We have that 


2 1 1 
n-7=(1-7)7-32 Y;, 
n n> 
J#i 


so the conditional variance of Y; — Y is given by 


= 1 1 1 
n-¥= (1-2) 4-2-2 5n, 


KAj 


and the conditional covariance between Y; — Y and Y ji Y is then given by 


1\ 1 =? 2 
-20° ( -1) j+? Ta 
n n n n 


(note that you can assume that the means of the expectations of the Y’s are 0 for this 
calculation). Therefore, the conditional variance of B2 is given by 
Var(B ea ener Xn) 
= =\9\2 =\9\2 
BENS ey) P AS po?) 
2 
o 


Eja a) 


because 


> Gi - DE; -¥) 
iAj 


n 2 n 
(Se -») -$ œ- 
= Sa 


610 Section 10.6: Further Proofs (Advanced) 


For (iii), we have that 
Cov(B,, Bo |X1 =X1,...,Xn = Xp) 
= Cov(Y — B2X, B2 | X1 =x1,...,Xn = Xn) 
= Cov(Y, Bo |X, = x1, ..., Xn = Xn) — X Var(B2 |X) = x1, ..., Xn = Xn) 
and 
Cov(Y, B2 |X1 = x1, ..., Xn = Xn) 
= Dia Oi — x )Cov((Y; —Y¥),Y¥ |X =Xx1,...,Xp = Xn) 


Dia Oi — x) 
-pa-m 
=o(1 MOSH Cy, > = 


Therefore, Cov(B1, B2 | X1 = x1, ..., Xn = Xn) = —0°X / ye x). 
Finally, for (1), we have, 
Var(By |X1 =x1,...,Xn = Xn) = Var(Y — Box |X1 = x1, ..., Xn = Xn) 
= Var(Ÿ | Xi = x1, <5 Xn = Xn) ee Var(B2 | X1 Se , Xn = Xn) 
—2Cov(Y, B2 |X, =x1,...,Xn = Xn) 

where Var(¥ |X, = x1, ..., Xn = Xn) = o7/n. Substituting the results for (ii) and 
(iii) completes the proof of the theorem. E 
Proof of Corollary 10.3.1 


We need to show that 


Var(By + Box | X p E, (r+ e ) 
ar( By ox |X, =x1,...,Xn = Xn) = 0^ | - + = } . 
ý í n >; —X)? 


For this, we have that 
Var(B, + Box | X1 =x1,..., Xn = Xn) 


= Var(B, | Xi =x1,..., Xn = xn) + x? Var(B2 | X1 =x1,..., Xn = Xn) 
+ 2x Cov(B1, B2 |X, =x1,..., Xn = Xn) 


E OT ee eee) E E a 
a G j pane =) = G E a1 Oi >) 
Proof of Theorem 10.3.4 


We want to show that, if E(Y|X = x) = By + Box, Var(Y |X = x) = o? for every 
x, and we observe the independent values (x1, y1), ... 5 Xn, Yn) for (X, Y), then 


E(S?|X, =x1,...,Xn = Xn) = 0°. 


Chapter 10: Relationships Among Variables 611 


We have that 


E((n —2)S?|X, =x1,...,Xn = Xn) 


n 
-e(o — By — Boxi) |X, = x1,- Xn =.) 


i=l 


n 
= IE -Y — Bai — DP | X1 = x1,- Xn = Xn) 
i=l 


n 
= J Var(¥i — ¥ — Baxi — 5) | X1 = m1... Xn = xn) 


i=l 
because 
E(Y; — fot — ¥)| X1 = x1, ..., Xn = Xn) 
= pı + Boxi — By — Box — Bo(xi —x) =0. 


Var(Y; — Y — Bo(x; — X)| X1 =x1,..., Xn = Xn) 

= Var(Y; — Ý | Xı Sis A =EN) 
—2(x; — X) Cov((Y; — Y), B2 | X1 = x1, ..., Xn = Xn) 
+ (xj — ¥)? Var(B2 | X1 =x1,..., Xn = Xn) 


and, using the results established about the covariances of the Y; — Y in the proof of 
Theorem 10.3.3, we have that 


Var(Y¥; —Y¥ |X, =x1,...,Xn = Xn) =o7(1 — 1/n) 
and 


Cov(¥; — Y, Bo|X1 = x1, ..., Xn = Xn) 
_ 1 
g Dai Oi =%)? 


= o? 1 S 1 re oi — 3) 


because >? ,4;(xj — X) = —(Qi — X). Therefore, 


n 
(xj —*)Cov(¥; — Y, Y; — Ý | Xi =x1,...,Xn = Xn) 
jal 


Var(Y¥; — Y — Bo(x; —X)|X1 =4x1,...,Xn = Xn) 
ay (: z >) a a7 (x; — x)? ie o? (x; — x)? 
Gi —*) Ee - x) 


feck. œ- 
=~ ( n as) 


612 Section 10.6: Further Proofs (Advanced) 


and 


n 


2 =\2 
oO 1 (x; — x) 2 
E(S? | X1 =x1,..., Xn = xn) = — eal EE a res 
(S |X] = x1 n = Xn) | ay >) i 


= 


as was stated. E 


Proof of Lemma 10.3.1 


We need to show that, if (x1, y1), .--5 (n,n) are such that X; (xi — x) £0, then 
n n n 
>.01 - IP =b DiGi — 2)? + D207 - bi — boxi). 
i=l i=l i=l 
We have that 


n n 
Zo- = Dov? -ny 
i=l i=l 

n 


= So — by — box; + bi + baxi)? — ny? 
i=l 


= So -bi — bx)? + Ve + baxi)? — ny? 
i=l i=l 


because 57) (vi —b1 —b2x;)(b1 +b2x;) = 0 by Lemma 10.6.1. Then, using Theorem 
10.3.1, we have 


n n n 
DiGi + bax? -ny = J 0 + bai — DY -ny = 13 ei — 3, 
i=l i=l i=l 
and this completes the proof. E 


Proof of Theorem 10.3.6 


We want to show that, if Y, given X = x, is distributed N(B; + Box, 07) and we 
observe the independent values (x1, y1),..., (Xn, Yn) for (X, Y), then the conditional 
distributions of Bı, B2, and S2, given X1 = X1, ..., Xn = Xn, are as follows. 

() Bi ~ Ni, 07 (1/n +3? Xia Ei =) 

(ii) By ~ N(Bo, 07/ Yj i — 3”) 


(iii) : 
A 2f1 (x — x) 
Bı + Box N (Pi +paxs0 G tay) 

(iv) (n — 2) S? /o? ~ y? (n — 2) independent of (B1, B2) 

We first prove (i). Because Bı can be written as a linear combination of the Y;, 
Theorem 4.6.1 implies that the distribution of Bı must be normal. The result then 
follows from Theorems 10.3.2 and 10.3.3. A similar proof establishes (ii) and (iii). 
The proof of (iv) is similar to the proof of Theorem 4.6.6, and we leave this to a further 
course in statistics. E 


Chapter 10: Relationships Among Variables 613 


Proof of Corollary 10.3.2 
We want to show a 
(i) (Bi — By)/S (1/n +?/ DP 2P) ~ 1-2) 
(ii) (Br — Bz) (S71 1 — #)?)'? /S ~ t(n -2) 
(iii) 
Bi + Box — Bi — Box 

Samt @ —3))/ D1 @ — 9°)” 
(iv) If F is defined as in (10.3.8), then Hy : By = 0 is true if and only if F ~ 
F(1,n—2). 

We first prove (i). Because B; and S? are independent 
Bı — py 
o (1/n +32/ "i — ¥)?)'? 


~ t(n —2) 


~ N(0, 1) 


independent of (n — 2)S?/o? ~ y(n — 2). Therefore, applying Definition 4.6.2, we 
have 
Bi - fy 
= on =49) 1/2 7 9) 1/2 
o (1/n + #?/ X; i — ¥)?) (0 — 2) S*/ (n — 2) 07) 
Bı- pı 


EP E A EA Y 
S (L/n +32/ 0" 6; -l wD) 


For (ii), the proof proceeds just as in the proof of (i). 

For (iii), the proof proceeds just as in the proof of (i) and also using Corollary 
10.3.1. 

We now prove (iv). Taking the square of the ratio in (ii) and applying Theorem 
4.6.11 implies 


BO) os BBD EG oy 9) 
S? (S07 — 82)” > 


Now observe that F defined by (10.3.8) equals G when £, = 0. The converse that 
F ~ F(,n — 2) only if $) = 0 is somewhat harder to prove and we leave this to a 
further course. E 


Chapter 11 


Advanced Topic — 
Stochastic Processes 


CHAPTER OUTLINE 


Section 1 Simple Random Walk 
Section 2 Markov Chains 

Section 3 Markov Chain Monte Carlo 
Section 4 Martingales 

Section 5 Brownian Motion 

Section 6 Poisson Processes 
Section 7 Further Proofs 


In this chapter, we consider stochastic processes, which are processes that proceed 
randomly in time. That is, rather than consider fixed random variables X, Y, etc., or 
even sequences of independent and identically distributed (i.i.d.) random variables, we 
shall instead consider sequences Xo, X1, X2,..., where X,, represents some random 
quantity at time n. In general, the value X, at time n might depend on the quantity 
Xn-1 at time n — 1, or even the values X,, for other times m < n. Stochastic processes 
have a different “flavor” from ordinary random variables — because they proceed in 
time, they seem more “alive.” 
We begin with a simple but very interesting case, namely, simple random walk. 


11.1 | Simple Random Walk 


Simple random walk can be thought of as a model for repeated gambling. Specifically, 
suppose you start with $a, and repeatedly make $1 bets. At each bet, you have proba- 
bility p of winning $1 and probability q of losing $1, where p + q = 1. If Xn is the 
amount of money you have at time n (henceforth, your fortune at time n), then Xo = a, 
while X1 could be a + 1 or a — 1, depending on whether you win or lose your first 
bet. Then X2 could be a + 2 (if you win your first two bets), or a (if you win once and 
lose once), or a — 2 (if you lose your first two bets). Continuing in this way, we obtain 


615 


616 Section 11.1: Simple Random Walk 


a whole sequence Xo, X1, X2, ... of random values, corresponding to your fortune at 
times 0,1,2,.... 

We shall refer to the stochastic process {X,} as simple random walk. Another way 
to define this model is to start with random variables {Z;} that are i.i.d. with P (Z; = 
1) = p and P(Z; = —1) = 1 — p = q, where 0 < p < 1. (Here, Z; = 1 if you win 
the ith bet, while Z; = —1 if you lose the ith bet.) We then set Xo = a, and for n > 1 
we set 


Xn =a +Z +Z2+: +Z. 
The following is a specific example of this. 


EXAMPLE 11.1.1 

Consider simple random walk with a = 8 and p = 1/3, so you start with $8 and have 
probability 1/3 of winning each bet. Then the probability that you have $9 after one 
bet is given by 


P(X, =9) = P(8 + Zi =9) = P(Z; = 1) = 1/3, 


as it should be. Also, the probability that you have $7 after one bet is given by 


P(X, =7) = P(8 + Zi = 7) = P(Zi = 1) = 2/3. 
On the other hand, the probability that you have $10 after two bets is given by 


P(X = 10) = P (8 + Zi + Z2 = 10) = PZ = Z = 1) = (1/3)(1/3) = 1/9. 08 


EXAMPLE 11.1.2 
Consider again simple random walk with a = 8 and p = 1/3. Then the probability 
that you have $7 after three bets is given by 


P(X3 = 7) = P(8 + Zi + Z2 + 23 =7)= P(Zi + Z2 + Z3 = —1). 
Now, there are three different ways we could have Z; + Z2 + Z3 = —1, namely: (a) 
Zı = 1, while Z2 = Z3 = —1; (b) Z2 = 1, while Z; = Z3 = —1; or (c) Z3 = 1, 


while Z; = Z2 = —1. Each of these three options has probability (1/3)(2/3)(2/3). 
Hence, 


P(X3 = 7) = (1/3)/3)2/3) + 1/3)2/3)@/3) + (1/3)(2/3)(2/3) = 4/9. E 


If the number of bets is much larger than three, then it becomes less and less con- 
venient to compute probabilities in the above manner. A more systematic approach is 
required. We turn to that next. 


11.1.1 | The Distribution of the Fortune 


We first compute the distribution of Xp, i.e., the probability that your fortune X, after 
n bets takes on various values. 


Chapter 11: Advanced Topic — Stochastic Processes 617 


Theorem 11.1.1 Let {X,,} be simple random walk as before, and let n be a positive 
integer. If k is an integer such that —n < k < n andn + k is even, then 


n = 
P(X, =at+h = (ela oye, 
2 


For all other values of k, we have P(X, = a +k) = 0. Furthermore, E (Xn) = 
a+n(2p—1). 


PROOF | See Section 11.7.1 


This theorem tells us the entire distribution, and expected value, of the fortune X, 
at time n. 


EXAMPLE 11.1.3 

Suppose p = 1/3, n = 8, anda = 1. Then P(X, = 6) = 0 because 6 = 1 + 5, and 
n+5 = 13 is not even. Also, P(X, = 13) = 0 because 13 = 1 + 12 and 12 > n. On 
the other hand, 


P(X, =5) = P(X, =1+4 = ea eee = (Jase 
et 


a /3)° (2/3)! = 0.0256. 


Also, E(X,) =a +n(2p — 1) = 1 + 8(2/3 — 1) = —-5/3.8 
Regarding E (X,), we immediately obtain the following corollary. 


This corollary has the following interpretation. If p = 1/2, then the game is fair, 
i.e., both you and your opponent have equal chance of winning each bet. Thus, the 
corollary says that for fair games, your expected fortune E(X,) will never change 
from its initial value, a. 

On the other hand, if p < 1/2, then the game is subfair, i.e., your opponent’s 
chances are better than yours. In this case, the corollary says your expected fortune 
will decrease, i.e., be less than its initial value of a. Similarly, if p > 1/2, then the 
game is superfair, and the corollary says your expected fortune will increase, i.e., be 
more than its initial value of a. 

Of course, in a real gambling casino, the game is always subfair (which is how the 
casino makes its profit). Hence, in a real casino, the average amount of money with 
which you leave will always be less than the amount with which you entered! 


EXAMPLE 11.1.4 

Suppose a = 10 and p = 1/4. Then E(X,) = 10 + n(2p — 1) = 10 —3n/4. Hence, 
we always have E(X,,) < 10, and indeed E (X,) < 0 ifn > 14. That is, your expected 
fortune is never more than your initial value of $10 and in fact is negative after 14 or 
more bets. E 


618 Section 11.1: Simple Random Walk 


Finally, we note as an aside that it is possible to change your probabilities by chang- 
ing your gambling strategy, as in the following example. Hence, the preceding analysis 
applies only to the strategy of betting just $1 each time. 


EXAMPLE 11.1.5 

Consider the “double ’til you win” gambling strategy, defined as follows. We first bet 
$1. Each time we lose, we double our bet on the succeeding turn. As soon as we win 
once, we stop playing (i.e., bet zero from then on). 

It is easily seen that, with this gambling strategy, we will be up $1 as soon as we 
win a bet (which must happen eventually because p > 0). Hence, with probability 1 
we will gain $1 with this gambling strategy for any positive value of p. 

This is rather surprising, because if0 < p < 1/2, then the odds in this game are 
against us. So it seems that we have “cheated fate,” and indeed we have. On the other 
hand, we may need to lose an arbitrarily large amount of money before we win our $1, 
so “infinite capital” is required to follow this gambling strategy. If only finite capital 
is available, then it is impossible to cheat fate in this manner. For a proof of this, see 
more advanced probability books, e.g., page 64 of A First Look at Rigorous Probability 
Theory, 2nd ed., by J. S. Rosenthal (World Scientific Publishing, Singapore, 2006). E 


11.1.2 | The Gambler’s Ruin Problem 


The previous subsection considered the distribution and expected value of the fortune 
Xn at a fixed time n. Here, we consider the gambler’s ruin problem, which requires 
the consideration of many different n at once, i.e., considers the time evolution of the 
process. 

Let {Xn} be simple random walk as before, for some initial fortune a and some 
probability p of winning each bet. Assume a is a positive integer. Furthermore, let 
c > a be some other integer. The gambler’s ruin question is: If you repeatedly bet $1, 
then what is the probability that you will reach a fortune of $c before you lose all your 
money by reaching a fortune $0? In other words, will the random walk hit c before 
hitting 0? Informally, what is the probability that the gambler gets rich (i.e., has $c) 
before going broke? 

More formally, let 


min{n > 0: X, = O}, 
min{n > 0: X, =c} 


TO 


Te 


be the first hitting times of 0 and c, respectively. That is, to is the first time your fortune 
reaches 0, while te is the first time your fortune reaches c. 
The gambler’s ruin question is: What is 


P(te < t0), 


the probability of hitting c before hitting 0? This question is not so easy to answer, 
because there is no limit to how long it might take until either c or 0 is hit. Hence, it is 
not sufficient to just compute the probabilities after 10 bets, or 20 bets, or 100 bets, or 
even 1,000,000 bets. Fortunately, it is possible to answer this question, as follows. 


Chapter 11: Advanced Topic — Stochastic Processes 619 


Theorem 11.1.2 Let {X,,} be simple random walk, with some initial fortune a 
and probability p of winning each bet. Assume 0 < a < c. Then the probability 
P(te < to) of hitting c before 0 is given by 


a/c p= 
p# 1/2. 


P(te < to) = 


PROOF | See Section 11.7 for the proof. E 


Consider some applications of this result. 


EXAMPLE 11.1.6 

Suppose you start with $5 (i.e., a = 5) and your goal is to win $10 before going broke 
(i.e., c = 10). If p = 0.500, then your probability of success is a/c = 0.500. If 
p = 0.499, then your probability of success is given by 


EDNET 


which is approximately 0.495. If p = 0.501, then your probability of success is given 


by 
—1 
1 _ (2499 s | _ (049 10 
0.501 0.501 i 


which is approximately 0.505. We thus see that in this case, small changes in p lead to 
small changes in the probability of winning at gambler’s ruin. E 


EXAMPLE 11.1.7 

Suppose now that you start with $5000 (i.e., a = 5000) and your goal is to win $10,000 
before going broke (i.e., c = 10, 000). If p = 0.500, then your probability of success 
is a/c = 0.500, same as before. On the other hand, if p = 0.499, then your probability 
of success is given by 


0.501 \ 5000 0.501 \ 10,000 a 
Le pea 1 p 
( (S55) )( G5) ) i 
which is approximately 2 x 107°, i.e., two parts in a billion! Finally, if p = 0.501, 
then your probability of success is given by 


0.499 \ 5000 9.499 \ 10,000 =! 
(if gee (sits 
( (ssa) )( (o) ) 
which is extremely close to 1. We thus see that in this case, small changes in p lead to 


extremely large changes in the probability of winning at gambler’s ruin. For example, 
even a tiny disadvantage on each bet can lead to a very large disadvantage in the long 


620 Section 11.1: Simple Random Walk 


run! The reason for this is that, to get from 5000 to 10,000, many bets must be made, 
so small changes in p have a huge effect overall. E 


Finally, we note that it is also possible to use the gambler’s ruin result to compute 
P(to < oo), the probability that the walk will ever hit 0 (equivalently, that you will 
ever lose all your money), as follows. 


Theorem 11.1.3 Let {X,,} be simple random walk, with initial fortune a > 0 and 
probability p of winning each bet. Then the probability P(to < oo) that the walk 
will ever hit 0 is given by 


1 pale 
(a/p) p > 1/2. 


P(to < oo) = | 


PROOF | See Section 11.7 for the proof. E 


EXAMPLE 11.1.8 
Suppose a = 2 and p = 2/3. Then the probability that you will eventually lose all 
your money is given by (q/ p)? = ((1/3)/(2/3))* = 1/4. Thus, starting with just $2, 
we see that 3/4 of the time, you will be able to bet forever without ever losing all your 
money. 

On the other hand, if p < 1/2, then no matter how large a is, it is certain that you 
will eventually lose all your money. E 


Summary of Section 11.1 


e A simple random walk is a sequence {X,} of random variables, with Xọ = 1 and 
P(Xn41 = Xn +1) = p =1 — Png = Xn — 1). 

e It follows that P(X, = a + k) = (xjr) por? ge DP fork = —n,—n+ 
2,—-n+4,...,n,and E(X) =a+n(Q2p — 1). 

e If0 <a <c, then the gambler’s ruin probability of reaching c before 0 is equal 
toa/c if p = 1/2, otherwise to (1 — ((1 — p)/p)*)/C — (A — p)/p)°). 


EXERCISES 


11.1.1 Let {X,} be simple random walk, with initial fortune a = 12 and probability 
p = 1/3 of winning each bet. Compute P(X, = x) for the following values of n and 
x. 
(a)n =0,x =13 
(b)n =1,x = 12 


(c)n =1,x =13 
(d)n=1,x=11 
(e)n =1,x =14 
(f n =2,x =12 
(g)n =2,x =13 
(h)n =2,x =14 


Chapter 11: Advanced Topic — Stochastic Processes 621 


(ijn =2,x =15 

(j)n =20,x =15 

(k)n = 20, x = 16 

(lI) n = 20,x = —18 

(m)n = 20,x = 10 

11.1.2 Let {X,,} be simple random walk, with initial fortune a = 5 and probability 
p = 2/5 of winning each bet. 

(a) Compute P(X, = 6, X2 = 5). 

(b) Compute P(X; = 4, X2 = 5). 

(c) Compute P (X2 = 5). 

(d) What is the relationship between the quantities in parts (a), (b), and (c)? Why is 
this so? 

11.1.3 Let {Xn} be simple random walk, with initial fortune a = 7 and probability 
p = 1/6 of winning each bet. 

(a) Compute P (X1 = X3 = 8). 

(b) Compute P(X, = 6, X; = 8). 

(c) Compute P (X3 = 8). 

(d) What is the relationship between the quantities in parts (a), (b), and (c)? Why is 
this so? 

11.1.4 Suppose a = 1000 and p = 0.49. 

(a) Compute £(X,,) for n = 0, 1,2, 10, 20, 100, and 1000. 

(b) How large does n need to be before E(X,) < 0? 

11.1.5 Let {X,,} be simple random walk, with initial fortune a and probability p = 
0.499 of winning each bet. Compute the gambler’s ruin probability P(t. < to) for the 
following values of a and c. Interpret your results in words. 

(aja =9,c = 10 

(b) a = 90, c = 100 

(c) a = 900, c = 1000 

(d) a = 9000, c = 10,000 

(e) a = 90,000, c = 100,000 

(f) a =900,000, c = 1,000,000 

11.1.6 Let {X;,} be simple random walk, with initial fortune a = 10 and probability p 
of winning each bet. Compute P(t9 < 00), where p = 0.4 and also where p = 0.6. 
Interpret your results in words. 

11.1.7 Let {X;,,} be simple random walk, with initial fortune a = 5, and probability 
p = 1/4 of winning each bet. 

(a) Compute P(X, = 6). 

(b) Compute P(X, = 4). 

(c) Compute P (X2 = 7). 

(d) Compute P(X2 = 7|X, = 6). 

(e) Compute P(X2 = 7|X1 = 4). 

(£) Compute P(X, = 6| X2 = 7). 

(g) Explain why the answer to part (f) equals what it equals. 


622 Section 11.1: Simple Random Walk 


11.1.8 Let {X;,} be simple random walk, with initial fortune a = 1000 and probability 
p = 2/5 of winning each bet. 

(a) Compute EF (X}). 

(b) Compute E (X10). 

(c) Compute E (X100). 

(d) Compute E (X1000). 

(e) Find the smallest value of n such that E (X,) < 0. 

11.1.9 Let {X,,} be simple random walk, with initial fortune a = 100 and probability 
p = 18/38 of winning each bet (as when betting on Red in roulette). 

(a) Compute P(X] > a). 

(b) Compute P(X? > a). 

(c) Compute P(X3 > a). 

(d) Guess the value of limp oo P (Xn > a). 

(e) Interpret part (d) in plain English. 


PROBLEMS 


11.1.10 Suppose you start with $10 and repeatedly bet $2 (instead of $1), having prob- 
ability p of winning each time. Suppose your goal is $100, i.e., you keep on betting 
until you either lose all your money, or reach $100. 

(a) As a function of p, what is the probability that you will reach $100 before losing all 
your money? Be sure to justify your solution. (Hint: You may find yourself dividing 
both 10 and 100 by 2.) 

(b) Suppose p = 0.4. Compute a numerical value for the solution in part (a). 

(c) Compare the probabilities in part (b) with the corresponding probabilities if you bet 
just $1 each time. Which is larger? 

(d) Repeat part (b) for the case where you bet $10 each time. Does the probability of 
success increase or decrease? 


CHALLENGES 


11.1.11 Prove that the formula for the gambler’s ruin probability P(te < to) is a 
continuous function of p, by proving that it is continuous at p = 1/2. That is, prove 


that 
1l—((_—p)/p)? a 


li — =, 
p>121— (0 -p/p © 


DISCUSSION TOPICS 


11.1.12 Suppose you repeatedly play roulette in a real casino, betting the same amount 
each time, continuing forever as long as you have money to bet. Is it certain that you 
will eventually lose all your money? Why or why not? 

11.1.13 In Problem 11.1.10, parts (c) and (d), can you explain intuitively why the 
probabilities change as they do, as we increase the amount we bet each time? 

11.1.14 Suppose you start at a and need to reach c, where c > a > 0. You must keep 
gambling until you reach either c or 0. Suppose you are playing a subfair game (i.e., 


Chapter 11: Advanced Topic — Stochastic Processes 623 


p < 1/2), but you can choose how much to bet each time (i.e., you can bet $1, or $2, 
or more, though of course you cannot bet more than you have). What betting amounts 
do you think! will maximize your probability of success, i.e., maximize P(te < 10)? 
(Hint: The results of Problem 11.1.10 may provide a clue.) 


11.2 | Markov Chains 


Intuitively, a Markov chain represents the random motion of some object. We shall 
write X, for the position (or value) of the object at time n. There are then rules that 
give the probabilities for where the object will jump next. 

A Markov chain requires a state space S, which is the set of all places the object 
can go. (For example, perhaps S = {1, 2, 3}, or S = {top, bottom}, or S is the set of all 
positive integers.) 

A Markov chain also requires transition probabilities, which give the probabilities 
for where the object will jump next. Specifically, for i, j € S, the number pj; is 
the probability that, if the object is at i, it will next jump to j. Thus, the collection 
{pij : i, j € S} of transition probabilities satisfies p;; > 0 for alli, j € S, and 


> Pi =1 
jes 


for each i € S. 

We also need to consider where the Markov chain starts. Often, we will simply 
set Xo = s for some particular state s € S. More generally, we could have an initial 
distribution {u; : i € S}, where u; = P(Xo = i). In this case, we need u; > 0 for 


each į €e S, and 
S u — l. 
ieS 


To summarize, here S is the state space of all places the object can go; u; represents 
the probability that the object starts at the point 7; and p;; represents the probability 
that, if the object is at the point i, it will then jump to the point j on the next step. In 
terms of the sequence of random values Xo, X1, X2,..., we then have that 


P(Xn41 =J | Xn =i) = pij 


for any positive integer n and any i, j € S. Note that we also require that this jump 
probability does not depend on the chain’s previous history. That is, we require 


P(Xn41 = J |Xn = i, Xn-1 = Xn-1,...,X0 = x0) = Pij 
for all n and all i, j, x0, ...,Xn—1 E S. 


lFor more advanced results about this, see, e.g., Theorem 7.3 of Probability and Measure, 3rd ed., by 
P. Billingsley (John Wiley & Sons, New York, 1995). 


624 Section 11.2: Markov Chains 


11.2.1 | Examples of Markov Chains 


We present some examples of Markov chains here. 


EXAMPLE 11.2.1 

Let S = {1, 2,3} consist of just three elements, and define the transition probabilities 
by pir = 0, pi2 = 1/2, pi3 = 1/2, p21 = 1/3, p22 = 1/3, p23 = 1/3, p31 = 1/4, 
p32 = 1/4, and p33 = 1/2. This means that, for example, if the chain is at the state 3, 
then it has probability 1/4 of jumping to state 1 on the next jump, probability 1/4 of 
jumping to state 2 on the next jump, and probability 1 /2 of remaining at state 3 on the 
next jump. 

This Markov chain jumps around on the three points {1,2,3} in a random and 
interesting way. For example, if it starts at the point 1, then it might jump to 2 or to 3 
(with probability 1/2 each). If it jumps to (say) 3, then on the next step it might jump to 
1 or 2 (probability 1/4 each) or 3 (probability 1/2). It continues making such random 
jumps forever. 

Note that we can also write the transition probabilities p;; in matrix form, as 


0 42 172 
(PiA)={ 1/3 1⁄3 1/3 
1/4 1/4 1/2 


(so that p3; = 1/4, etc.). The matrix (pi j) is then called a stochastic matrix. This 
matrix representation is convenient sometimes. H 


EXAMPLE 11.2.2 
Again, let S = {1, 2,3}. This time define the transition probabilities { Di j} in matrix 
form, as 


1/4 1/4 1/2 


@j)={ 13 13 17 
0.01 0.01 0.98 


This also defines a Markov chain on S. For example, from the state 3, there is proba- 
bility 0.01 of jumping to state 1, probability 0.01 of jumping to state 2, and probability 
0.98 of staying in state 3. E 


EXAMPLE 11.2.3 
Let S = {bedroom, kitchen, den}. Define the transition probabilities {p;;} in matrix 
form by 

1/4 1/4 1/2 

(Pij) = 0 0 l 

0.01 0.01 0.98 
This defines a Markov chain on S. For example, from the bedroom, the chain has 
probability 1/4 of staying in the bedroom, probability 1/4 of jumping to the kitchen, 
and probability 1/2 of jumping to the den. E 


Chapter 11: Advanced Topic — Stochastic Processes 625 


EXAMPLE 11.2.4 
This time let S = {1, 2,3, 4}, and define the transition probabilities {p;;} in matrix 
form, as 
0.2 04 0 04 
_ | 04 02 04 0 
P=) 9 04 02 04 
04 0 04 0.2 


This defines a Markov chain on S. For example, from the state 4, it has probability 0.4 
of jumping to the state 1, but probability 0 of jumping to the state 2. E 


EXAMPLE 11.2.5 


This time, let S = {1,2,3,4,5, 6, 7}, and define the transition probabilities {p;;} in 
matrix form, as 


1 0 0 0 0 0 0 
1/2 0 12 0 0 0 0 
0 1/5 4/5 0 0 0 0 
i=! 0 0 13 13 1/3 0 0 
1/10 0 0 0 7/10 0 1/5 
o 0o 0o 0 0 0 1 
0 0 0 0 0 1 0 


This defines a (complicated!) Markov chain on S. E 


EXAMPLE 11.2.6 Random Walk on the Circle 

Let S = {0,1,2,...,d — 1}, and define the transition probabilities by saying that 
pii = 1/3 for alli € S, and also p;; = 1/3 whenever i and j are “next to” each other 
around the circle. That is, pi; = 1/3 whenever j =i, or j =i +1, or j =i—1. Also, 
P0,d—1 = Pd-1,0 = 1/3. Otherwise, pij = 0. 

If we think of the d elements of S as arranged in a circle, then our object, at each 
step, either stays where it is, or moves one step clockwise, or moves one step counter- 
clockwise — each with probability 1/3. (Note in particular that it can go around the 
“corner” by jumping from d — | to 0, or from 0 to d — 1, with probability 1/3.) B 


EXAMPLE 11.2.7 Ehrenfest’s Urn 

Consider two urns, urn #1 and urn #2, where d balls are divided between the two urns. 
Suppose at each step, we choose one ball uniformly at random from among the d balls 
and switch it to the opposite urn. We let X„ be the number of balls in urn #1 at time n. 
Thus, there are d — X, balls in urn #2 at time n. 

Here, the state space is S = {0,1,2,...,d} because these are all the possible 
numbers of balls in urn #1 at any time n. 

Also, if there are į balls in urn #1 at some time, then there is probability i/n that 
we next choose one of those 7 balls, in which case the number of balls in urn #1 goes 
down toi — 1. Hence, 

Pii-1 =i/d. 


Similarly, 
Pii+1 = (d — i)/d 


626 Section 11.2: Markov Chains 


because there is probability (d — i)/d that we will instead choose one of the d — i 
balls in urn #2. Thus, this Markov chain moves randomly among the possible numbers 
{0, 1,...,d} of balls in urn #1 at each time. 

One might expect that, if d is large and the Markov chain is run for a long time, 
there would most likely be approximately d/2 balls in urn #1. (We shall consider such 
questions in Section 11.2.4.) E 


The above examples should convince you that Markov chains on finite state spaces 
come in all shapes and sizes. Markov chains on infinite state spaces are also important. 
Indeed, we have already seen one such class of Markov chains. 


EXAMPLE 11.2.8 Simple Random Walk 
Let S = {..., —2, —1,0, 1,2, ...} be the set of all integers. Then S is infinite, so we 
cannot write the transition probabilities {p;;} in matrix form. 

Fix a € S, and let Xo = a. Fix areal number p with 0 < p < 1, and let p;i i+1 = p 
and pji-1 = 1 — p for each i € Z, with p;i; = 0 if j Ai +1. Thus, this Markov 
chain begins at the point a (with probability 1) and at each step either increases by 1 
(with probability p) or decreases by 1 (with probability 1 — p). It is easily seen that 
this Markov chain corresponds precisely to the random walk (i.e., repeated gambling) 
model of Section 11.1.2. E 


Finally, we note that in a group, you can create your own Markov chain, as follows 
(try it — it’s fun!). 


EXAMPLE 11.2.9 

Form a group of between 5 and 50 people. Each group member should secretly pick 
out two other people from the group, an “A person” and “B person.” Also, each group 
member should have a coin. 

Take any object, such as a ball, or a pen, or a stuffed frog. Give the object to one 
group member to start. This person should then immediately flip the coin. If the coin 
comes up heads, the group member gives (or throws!) the object to his or her A person. 
If it comes up tails, the object goes to his or her B person. The person receiving the 
object should then immediately flip the coin and continue the process. (Saying your 
name when you receive the object is a great way for everyone to meet each other!) 

Continue this process for a large number of turns. What patterns do you observe? 
Does everyone eventually receive the object? With what frequency? How long does it 
take the object to return to where it started? Make as many interesting observations as 
you can; some of them will be related to the topics that follow. E 


11.2.2 | Computing with Markov Chains 


Suppose a Markov chain {X,} has transition probabilities {p;;} and initial distribution 
{u;}. Then P(Xo = i) = u; for all states i. What about P(X; = i)? We have the 
following result. 


Chapter 11: Advanced Topic — Stochastic Processes 627 


Theorem 11.2.1 Consider a Markov chain {X,,} with state space S, transition prob- 
abilities {p;;}, and initial distribution {y;}. Then for any i € S, 


FOG =i) = È ukpri. 
kes 


PROOF | From the law of total probability, 


P(X =i) = > P =k, X, =i). 
kes 


But P(Xo = k, Xı =i) = P(Xo = k) P(X1 = i | Xo = k) = wy pri and the result 
follows. E 


Consider an example of this. 


EXAMPLE 11.2.10 
Again, let S = {1, 2, 3}, and 


1/4 1/4 1/2 


Pj= 13 13 17 
0.01 0.01 0.98 


Suppose that P (Xo = 1) = 1/7, P(Xo = 2) = 2/7, and P(Xo = 3) = 4/7. Then 


P(X =3) = X ape = (1/7)(1/2) + 2/7)(1/3) + (4/7)0.98) = 0.73. 
keS 


Thus, about 73% of the time, this chain will be in state 3 after one step. E 
To proceed, let us write 


P;(A) = P(A| Xo =i) 


for the probability of the event A, assuming that the chain starts in the state i, that is, 
assuming that u; = 1 and u; = 0 for j # i. We then see that Pj)(X, = j) is the 
probability that, if the chain starts in state 7 and is run for n steps, it will end up in state 
j. Can we compute this? 

For n = 0, we must have Xo = i. Hence, P;(Xo = j) = lifi = j, while 
P(X = jf) =O iti # j. 

For n = 1, we see that P;(X1 = j) = pij. That is, the probability that we will be 
at the state j after one step is given by the transition probability p;;. 

What about for n = 2? If we start at i and end up at j after 2 steps, then we have 
to be at some state after 1 step. Let k be this state. Then we see the following. 


PROOF | If we start at 7, then the probability of jumping first to k is equal to pix. 
Given that we have jumped first to k, the probability of then jumping to j is given by 


628 Section 11.2: Markov Chains 


pkj. Hence, 
Pi =k, Hf) = PQ =k, =f (X=) 
= P(X =k|Xo =i) Py =k |X =j, Xo =i) 
= PikpK-" 


Using this, we obtain the following. 


PROOF | By the law of total probability, 


ROG=7) => RX =k, X% = j), 
keS 


so the result follows from Theorem 11.2.2. E 


EXAMPLE 11.2.11 
Consider again the chain of Example 11.2.1, with S = {1, 2,3} and 


0 1/2 1/2 
(pj)={| 1/3 1⁄3 1/3 
1/4 1/4 1/2 
Then 
P(X: =3) = > pupe = p11 P13 + Pi2p23 + p13 p33 


kes 
Od/2 + 0/21/3) + A/2(1/2) = 1/6 + 1/4 = 5/12. 0 


By induction (see Problem 11.2.18), we obtain the following. 


Theorem 11.2.4 We have 


ROG=D= 2 Pit PininPiniy o Di a aBa 


i1,12,...,in-1€S 


PROOF | See Problem 11.2.18. E 


Theorem 11.2.4 thus gives a complete formula for the probability, starting at a 
state į at time 0, that the chain will be at some other state j at time n. We see from 
Theorem 11.2.4 that, once we know the transition probabilities p;; for all i, j € S, 
then we can compute the values of P;(X, = j) for all i,j © S, and all positive 
integers n. (The computations get pretty messy, though!) The quantities P;(X, = J) 
are sometimes called the higher-order transition probabilities. 

Consider an application of this. 


Chapter 11: Advanced Topic — Stochastic Processes 629 


EXAMPLE 11.2.12 
Consider once again the chain with S = {1, 2, 3} and 


0 1/2 1/2 
(piy)= | 1/3 1⁄3 1/3 
1/4 1/4 1/2 
Then 
P\(X%3 = 3) = SD pupupe 


kes eS 

= p11 p11 p13 + p11 p12P23 + PIL P13 P33 + P12P21 P13 + P12 p22p23 + P12 P23 P33 
+ P13P31P13 + P13 P32P23 + P13 P33 P33 

= (0)(0)(1/2) + 0) /2)0/3) + (0)0/2)C1/2) + 1/2)1/3) 1/2) 
+ (1/2)(1/3)(1/3) + /2)0/4)0/2) + 0/2)0/4) (1/2) 
+ (1/21/4403) + (1/2)1/2)1./2) 

= 31/72.8 


Finally, we note that if we write A for the matrix (p;;), write vo for the row vec- 
tor (u;) = (P(Xo = i)), and write vı for the row vector (P(X, = i)), then Theo- 
rem 11.2.1 can be written succinctly using matrix multiplication as vı = vo A. That 
is, the (row) vector of probabilities for the chain after one step vı is equal to the (row) 
vector of probabilities for the chain after zero steps vo, multiplied by the matrix A of 
transition probabilities. In fact, if we write v, for the row vector (P(X, = i)), then 
proceeding by induction, we see that vn+1 = v,A for each n. Therefore, on = v0 A”, 
where A” is the nth power of the matrix A. In this context, Theorem 11.2.4 has a par- 
ticularly nice interpretation. It says that P;(X, = j) is equal to the (i, j) entry of the 
matrix A”, i.e., the nth power of the matrix A. 


11.2.3 | Stationary Distributions 
Suppose we have Markov chain transition probabilities {p;;} on a state space S. Let 


{x; : i € S} bea probability distribution on S, so that z; > 0 for alli, and Ñ jeg Ti = 
1. We have the following definition. 


Definition 11.2.1 The distribution {z; : i € S} is stationary for a Markov chain 


with transition probabilities {p;;} on a state space S, if Xjes Tipi; = 7 j for all 
jes. 


The reason for the terminology “stationary” is that, if the chain begins with those 
probabilities, then it will always have those same probabilities, as the following theo- 
rem and corollary show. 


Theorem 11.2.5 Suppose {z; : i € S} is a stationary distribution for a Markov 
chain with transition probabilities {p;;} on a state space S. Suppose that for some 


integer n, we have P(X, =i) = m; for alli € S. Then we also have P(Xy41 = 
i) =7; foralli € S. 


630 Section 11.2: Markov Chains 


PROOF | If {z;} is stationary, then we compute that 


P(X, =j) = > P, =i, LH =j) 
ieS 
= $ PA=) PO =F | Xn =) = Domi py = nA 
ieS ieS 


By induction, we obtain the following corollary. 


Corollary 11.2.1 Suppose {z; : i € S} is a stationary distribution for a Markov 
chain with transition probabilities {p;;} on a state space S. Suppose that for some 


integer n, we have P(X, = i) = 7; forall i € S. Then we also have P(Xm =i) = 
mi for alli e S and all integers m > n. 


The above theorem and corollary say that, once a Markov chain is in its stationary 
distribution, it will remain in its stationary distribution forevermore. 


EXAMPLE 11.2.13 
Consider the Markov chain with S = {1, 2, 3}, and 


1/2 1/4 1/4 
(pij)={ 1/2 1/4 1/4 
1/2 1/4 1/4 


No matter where this Markov chain is, it always jumps with the same probabilities, 
i.e., to state 1 with probability 1/2, to state 2 with probability 1/4, or to state 3 with 
probability 1/4. 
Indeed, if we set mı = 1/2, m2 = 1/4, and z3 = 1/4, then we see that pj; = 7 j 

for all i, j € S. Hence, 

S mipi = X minj — tj > ri = m j(1) =1j- 

ieS ieS ieS 
Thus, {z;} is a stationary distribution. Hence, once in the distribution {7;}, the chain 
will stay in the distribution {;} forever. E 


EXAMPLE 11.2.14 
Consider a Markov chain with S = {0, 1} and 


0.1 0.9 
(Ris) = ( 0.6 0.4 ). 


If this chain had a stationary distribution {z;}, then we must have that 


z0(0.1)+71(0.6) = Zo, 
z0(0.9)+7,(00.4) = zy. 
The first equation gives z | (0.6) = 20(0.9), so mı = (3/2)(z0). This is also consistent 


with the second equation. In addition, we require that 79 + mı = 1, i.e., that mo + 
(3/2)zo = 1, so that zo = 2/5. Then zı = (3/2)(2/5) = 3/5. 


Chapter 11: Advanced Topic — Stochastic Processes 631 


We then check that the settings ro = 2/5 and mı = 3/5 satisfy the above equa- 
tions. Hence, {z;} is indeed a stationary distribution for this Markov chain. E 


EXAMPLE 11.2.15 
Consider next the Markov chain with S = {1, 2, 3}, and 


Oe. 1/2 
=| 1/2 0 1/2 
1/2 1/2 0 


We see that this Markov chain has the property that, in addition to having $` jes Pij = 
1, for all i, it also has >°;.5 pi; = 1, for all j. That is, not only do the rows of the 
matrix (p;;) sum to 1, but so do the columns. (Such a matrix is sometimes called 
doubly stochastic.) 

Let mı = m2 = 73 = 1/3, so that {z;} is the uniform distribution on S. Then we 
compute that 


> tip = >_0/3)py = 0/3) >) py = (1/3)0) = wy. 
ieS ieS ieS 
Because this is true for all j, we see that {7 ;} is a stationary distribution for this Markov 


chain. E 


EXAMPLE 11.2.16 
Consider the Markov chain with S = {1, 2, 3}, and 


1/2 1/4 1/4 
(Pijs={ 1/3 1⁄3 1/3 
0 1/4 3/4 


Does this Markov chain have a stationary distribution? 
Well, if it had a stationary distribution {7;}, then the following equations would 
have to be satisfied: 


m = (1/2)a1 + (1/3)r2 + (0)r3, 
m2 = (1/4)a1 + (1/3)r2 + (1/473, 
m3 = (1/rı +(1/3)r2 + 8/473. 


The first equation gives 7; = (2/3)æz 2. The second equation then gives 
/Hr3 = m2 — (1/4)a1 — (1/3)r2 = m2 — (1/4) 2/3)a2 — (1/3)z2 = (1/2)72, 


so that 73 = 272. 

But we also require 71 + m2 +73 = 1, i.e., (2/3)t2 + 22 +22 = 1, so that 
m2 = 3/11. Then zı = 2/11, and z3 = 6/11. 

It is then easily checked that the distribution given by zı = 2/11, 72 = 3/11, and 
m3 = 6/11 satisfies the preceding equations, so it is indeed a stationary distribution for 
this Markov chain. E 


632 Section 11.2: Markov Chains 


EXAMPLE 11.2.17 
Consider again random walk on the circle, as in Example 11.2.6. We observe that for 
any state j, there are precisely three states i (namely, the state i = j, the state one 
clockwise from j, and the state one counterclockwise from j) with pj; = 1/3. Hence, 
Dies Pij = 1. That is, the transition matrix (p;;) is again doubly stochastic. 

It then follows, just as in Example 11.2.15, that the uniform distribution, given by 
x; = l/d fori =0,1,...,d—1, is a stationary distribution for this Markov chain. E 


EXAMPLE 11.2.18 

For Ehrenfest’s urn (see Example 11.2.7), it is not obvious what might be a stationary 
distribution. However, a possible solution emerges by thinking about each ball individ- 
ually. Indeed, any given ball usually stays still but occasionally gets flipped from one 
urn to the other. So it seems reasonable that in stationarity, it should be equally likely 
to be in either urn, i.e., have probability 1/2 of being in urn #1. 

If this is so, then the total number of balls in urn #1 would have the distribution 
Binomial(n, 1/2), since there would be n balls, each having probability 1/2 of being 
in urn #1. 

To test this, we set m; = 6/25, fori = 0,1,...,d. We then compute that if 
1 < j <d-—1, then 


Š ripi = 1 j-1Pj-1,j + 7 j4+1Pj4,/ 
ieS 


d \1d-(j-1) d \1l1j+!l 
=(. — ——— + [. = — 
~1)24 d j+1)2 d 


_(d-! 1 d-l1\ 1 
~ AG -1/)24 j jad 


Next, we use the identity known as Pascal’s triangle, which says that 
G=)+(7)-Q) 
i + : =|. b 
j=l J J 


d\ 1 
> tiPy = j Tims 


ieS 


Hence, we conclude that 


With minor modifications (see Problem 11.2.19), the preceding argument works for 
j =Oand j = d as well. We therefore conclude that Xjes Tipi; = 7j, forall j e S. 
Hence, {z;} is a stationary distribution. E 


One easy way to check for stationarity is the following. 


Definition 11.2.2 A Markov chain is said to be reversible with respect to a distrib- 
ution {7;} if, for alli, j € S, we have 2; pij = T j pi. 


Chapter 11: Advanced Topic — Stochastic Processes 633 


PROOF | We compute, using reversibility, that for any j € S, 


Da = > zjpji =a pj =aj;(l)=7;. 


ieS ieS ieS 
Hence, {z;} is a stationarity distribution. E 


EXAMPLE 11.2.19 
Suppose S = {1, 2,3, 4, 5}, and the transition probabilities are given by 


1/3 2/3 0 0 0 
1⁄3 0 23 0 0 
(=| 0 13 0 23 0 
0 0 13 0 2/3 
0 0 0 1⁄3 283 


It is not immediately clear what stationary distribution this chain may possess. Fur- 
thermore, to compute directly as in Example 11.2.16 would be quite messy. 

On the other hand, we observe that for 1 < i < 4, we always have pj j4) = 
2 pi+1,i- Hence, if we set m; = C2! for some C > 0, then we will have 


@iPiitl = C2! pig, = C2'2 pista, 


while 
isi Pitis = CŽ t pipii 
Hence, 7; pi,i+1 = 1i41pi+i,i for each i. 

Furthermore, p;; = 0 ifi and j differ by at least 2. It follows that 7; pij = 7 j pji 
for each i, j € S. Hence, the chain is reversible with respect to {z;}, and so {z;} isa 
stationary distribution for the chain. 

Finally, we solve for C. We need >’,;.;2; = 1. Hence, we must have C = 
1/ Dyes 2! =1/ 3, 2! = 1/63. Thus, z; = 2'/63, fori e S.U 


11.2.4 | Markov Chain Limit Theorem 


Suppose now that {Xn} is a Markov chain, which has a stationary distribution {7z;}. We 
have already seen that, if P(X, =i) = 7; for alli for some n, then also P(Xm =i) = 
a; for alli forallm >n. 

Suppose now that it is not the case that P(X, = i) = a; for all i. One might still 
expect that, if the chain is run for a long time (i.e., n — oo), then the probability of 
being at a particular state i € S might converge to z;, regardless of the initial state 
chosen. That is, one might expect that 


lim P(X, =i) =i, (11.2.1) 
noo 


for each i € S, regardless of the initial distribution {j/;}. 
This is not true in complete generality, as the following two examples show. How- 
ever, we Shall see in Theorem 11.2.8 that this is indeed true for most Markov chains. 


634 Section 11.2: Markov Chains 


EXAMPLE 11.2.20 
Suppose that S = {1, 2} and that the transition probabilities are given by 


o=(5 a 


That is, this Markov chain never moves at all! Suppose also that u; = 1, i.e., that we 
always have Xo = 1. 

In this case, any distribution is stationary for this chain. In particular, we can take 
Tı = m2 = 1/2 as a stationary distribution. On the other hand, we clearly have 
Pi(Xn = 1) = 1l for all n. Because mı = 1/2, and 1 4 1/2, we do not have 
limno P(Xn = i) = z; in this case. 

We shall see later that this Markov chain is not “irreducible,” which is the obstacle 
to convergence. I 


EXAMPLE 11.2.21 
Suppose again that S = {1, 2}, but that this time the transition probabilities are given 


by 
0 1 


That is, this Markov chain always moves from | to 2, and from 2 to 1. Suppose again 
that uw; = 1, i.e., that we always have Xo = 1. 

We may again take mı = m2 = 1/2 as a stationary distribution (in fact, this time 
the stationary distribution is unique). On the other hand, this time we clearly have 
P\(X, = 1) = 1 for n even, and P| (X, = 1) = 0 for n odd. Hence, again we do not 
have limy599 Pi (Xn = 1) > z1 = 1/2. 

We shall see that here the obstacle to convergence is that the Markov chain is “pe- 
riodic,” with period 2. E 


In light of these examples, we make some definitions. 


Definition 11.2.3 A Markov chain is irreducible if it is possible for the chain to 


move from any state to any other state. Equivalently, the Markov chain is irreducible 
if for any i, j € S, there is a positive integer n with P; (Xn = j) > 0. 


Thus, the Markov chain of Example 11.2.20 is not irreducible because it is not 
possible to get from state 1 to state 2. Indeed, in that case, Pı (X, = 2) = 0 for all n. 


EXAMPLE 11.2.22 
Consider the Markov chain with S = {1, 2, 3}, and 


1/2 1/2 0 
(pij)=( 1/2 1/4 1/4 
1/2 1/4 1/4 


For this chain, it is not possible to get from state 1 to state 3 in one step. On the other 
hand, it is possible to get from state 1 to state 2, and then from state 2 to state 3. Hence, 
this chain is still irreducible. E 


Chapter 11: Advanced Topic — Stochastic Processes 635 


EXAMPLE 11.2.23 
Consider the Markov chain with S = {1, 2, 3}, and 


1/2 1/2 0 
(p) =| 3/4 1/4 0 
1/2 1/4 1/4 


For this chain, it is not possible to get from state 1 to state 3 in one step. Furthermore, 
it is not possible to get from state 2 to state 3, either. In fact, there is no way to ever get 
from state 1 to state 3, in any number of steps. Hence, this chain is not irreducible. E 


Clearly, if a Markov chain is not irreducible, then the Markov chain convergence 
(11.2.1) will not always hold, because it will be impossible to ever get to certain states 
of the chain. 

We also need the following definition. 


Definition 11.2.4 Given Markov chain transitions {p;;} on a state space S, and 
a state i € S, the period of i is the greatest common divisor (g.c.d.) of the set 


: p > 0}, where p® = P(X, =i | Xo =i). 


That is, the period of is the g.c.d. of the times at which it is possible to travel from 
i toi. For example, the period of i is 2 if it is only possible to travel from 7 to i in an 
even number of steps. (Such was the case for Example 11.2.21.) On the other hand, if 
pii > 0, then clearly the period of i is 1. 

Clearly, if the period of some state is greater than 1, then again (11.2.1) will not 
always hold, because the chain will be able to reach certain states at certain times only. 
This prompts the following definition. 


EXAMPLE 11.2.24 
Consider the Markov chain with S = {1, 2, 3}, and 


ore 


0 1 
(pij)=[ 0 0 
1 0 


For this chain, from state 1 it is possible only to get to state 2. And from state 2 it 
is possible only to get to state 3. Then from state 3 it is possible only to get to state 
1. Hence, it is possible only to return to state 1 after an integer multiple of 3 steps. 
Hence, state 1 (and, indeed, all three states) has period equal to 3, and the chain is not 
aperiodic. E 


EXAMPLE 11.2.25 
Consider the Markov chain with S = {1, 2, 3}, and 


0 1 0 
(Pj=; 0 0 1 
1/2 0 1/2 


636 Section 11.2: Markov Chains 


For this chain, from state 1 it is possible only to get to state 2. And from state 2 it is 
possible only to get to state 3. However, from state 3 it is possible to get to either state 
1 or state 3. Hence, it is possible to return to state 1 after either 3 or 4 steps. Because 
the g.c.d. of 3 and 4 is 1, we conclude that the period of state 1 (and, indeed, of all 
three states) is equal to 1, and the chain is indeed aperiodic. E 


We note the following simple fact. 


Theorem 11.2.7 If a Markov chain has p;; > 0 for alli, j € S, then the chain is 


irreducible and aperiodic. 


PROOF | If p;i; > 0 for alli, j € S, then P(X, = j) > 0 for alli, j € S. Hence, 
the Markov chain must be irreducible. a 
n 


Also, if pij > 0 for alli, j € S, then the set {n > 1: pẹ > 0} contains the value 
n = | (and, indeed, all positive integers n). Hence, its greatest common divisor must 
be 1. Therefore, each state i has period 1, so the chain is aperiodic. E 


In terms of the preceding definitions, we have the following very important theorem 
about Markov chain convergence. 


Theorem 11.2.8 Suppose a Markov chain is irreducible and aperiodic and has a 


stationary distribution {x ;}. Then regardless of the initial distribution {u;}, we have 
limpo P(X; =i) = 7; for all states i. 


PROOF | For a proof of this, see more advanced probability books, e.g., pages 92—93 
of A First Look at Rigorous Probability Theory, 2nd ed., by J. S. Rosenthal (World 
Scientific Publishing, Singapore, 2006). 


Theorem 11.2.8 shows that stationary distributions are even more important. Not 
only does a Markov chain remain in a stationary distribution once it is there, but for 
most chains (irreducible and aperiodic ones), the probabilities converge to the station- 
ary distribution in any case. Hence, the stationary distribution provides fundamental 
information about the long-term behavior of the Markov chain. 


EXAMPLE 11.2.26 
Consider again the Markov chain with S = {1, 2,3}, and 


1/2 1/4 1/4 
(pij)={ 1/2 1/4 1/4 
1/2 1/4 1/4 


We have already seen that if we set mı = 1/2, m2 = 1/4, and m3 = 1/4, then {z;} 
is a stationary distribution. Furthermore, we see that pj; > 0 for alli, j € S, so by 
Theorem 11.2.7 the Markov chain must be irreducible and aperiodic. 

We conclude that limy_5o0 P (Xn =i) = 7; for all states i. For example, limy_5 oo 
P(X, = 1) = 1/2. (Also, this limit does not depend on the initial distribution, so, for 
example, limy—yo0 Pi (Xn = 1) = 1/2 and limno Po(Xy = 1) = 1/2, as well.) 

In fact, for this example we will have P(X, =i) = 7; for alli providedn > 1.8 


Chapter 11: Advanced Topic — Stochastic Processes 637 


EXAMPLE 11.2.27 
Consider again the Markov chain of Example 11.2.14, with S = {0, 1} and 


0.1 0.9 
(Pi) = ( 0.6 0.4 ). 
We have already seen that this Markov chain has a stationary distribution, given by 
mo =2/5 and m1 = 3/5. 

Furthermore, because p;; > 0 for alli, j € S, this Markov chain is irreducible 
and aperiodic. Therefore, we conclude that limp—+oo P(Xn = i) = m;i. So, if (say) 
n = 100, then we will have P(X100 = 0) ~ 2/5, and P(X100 = 1) © 3/5. Once again, 
this conclusion does not depend on the initial distribution, so, e.g., liMmn— oo Po(Xn = 
i) = limy-yo0 Pi (Xn =i) = 7; as well. E 


EXAMPLE 11.2.28 
Consider again the Markov chain of Example 11.2.16, with S = {1, 2, 3}, and 


1/2 1/4 1/4 
(p) =| 1/3 1/3 1/3 
0 1/4 3/4 


We have already seen that this chain has a stationary distribution {z;} given by 7; = 
2/11, z2 =3/l11, and z3 = 6/11. 

Now, in this case, we do not have p;; > 0 for alli, j € S because p3; = 0. On the 
other hand, p32 > 0 and p21 > 0, so by Theorem 11.2.3, we have 


P (X2 = 1) = $ ppr = p32p21 > 0. 
keS 


Hence, the chain is still irreducible. 

Similarly, we have P3(X2 = 3) > p32p23 > 0, and P3(X3 = 3) > p32p21p13 > 0. 
Therefore, because the g.c.d. of 2 and 3 is 1, we see that the g.c.d. of the set of n with 
P3(X;, = 3) > 0 is also 1. Hence, the chain is still aperiodic. 

Because the chain is irreducible and aperiodic, it follows from Theorem 11.2.8 that 
limno P(X) = i) = 7;, for all states i. Hence, lim, 59, P(X, = 1) = 2/11, 
limao P(Xn = 2) = 3/11, and lim, P(X, = 3) = 6/11. Thus, if (say) n = 
500, then we expect that P(X 509 = 1) © 2/11, P(Xs00 = 2) © 3/11, and P (X500 = 
3) + 6/11. 8 


Summary of Section 11.2 


e A Markov chain is a sequence {X,,} of random variables, having transition prob- 
abilities {p;;} such that P(Xn41 = j|Xn = i) = pij, and having an initial 
distribution {u;} such that P (Xo = i) = ui. 

e There are many different examples of Markov chains. 


e All probabilities for all the X, can be computed in terms of {y;} and {p;;}. 
e A distribution {7 ;} is stationary for the chain if X; cs Tipi; = 7; forall j e S. 


638 Section 11.2: Markov Chains 


e If the Markov chain is irreducible and aperiodic, and {z;} is stationary, then 
limno P(X; =i) = 7; foralli e S. 


EXERCISES 


11.2.1 Consider a Markov chain with S = {1,2,3}, uw; = 0.7, wy = 0.1, u3 = 0.2, 
and 


1/4 1/4 1/2 
(pi) =| 1/6 1/2 1/3 
1/8 3/8 1/2 
Compute the following quantities. 
(a) P(X0 = 1) 
(b) P(X0 = 2) 
(c) P(X0 = 3) 


(d) P(X1 = 2|X0 = I) 

(e) P(X3 =2|X2 = 1) 

(f) P(X, =2| Xo = 2) 

(g) P(X = 2). 

11.2.2 Consider a Markov chain with S = {high, low}, “nigh = 1/3, Hiow = 2/3, and 


1/4 3/4 


Compute the following quantities. 

(a) P(Xo = high) 

(b) P (Xo = low) 

(c) P(X, = high | Xo = high) 

(d) P(X3 = high | X2 = low) 

(e) P(X) = high) 

11.2.3 Consider a Markov chain with S = {0, 1}, and 


0.2 08 
(Pis) = ( 0.3 0.7 ). 


(a) Compute P; (X2 = j) for all four combinations of i, j € S. 
(b) Compute Po(X3 = 1). 
11.2.4 Consider again the Markov chain with S = {0, 1} and 


0.2 0.8 
(Py) = ( 0.3 0.7 }: 


(a) Compute a stationary distribution {m ;} for this chain. 
(b) Compute limy-5 oo Po(Xn = 0). 
(c) Compute limp 95 Pi (Xn = 0). 


Chapter 11: Advanced Topic — Stochastic Processes 639 


11.2.5 Consider the Markov chain of Example 11.2.5, with S = {1, 2,3, 4,5, 6, 7} 


and 
1 0 0 0 0 0 0 
1⁄2 0 1/2 0 0 0 0 
0 1/5 4/5 0 0 0 0 
(pij) = 0 0 1⁄3 1/73 1/3 0 0 
1/10 0 0 0 7/10 0 1/5 
0 0 0 0 0 0 1 
0 0 0 0 0 1 0 


Compute the following quantities. 

(a) P(X = 1) 

(b) Po(X1 = 2) 

(c) P(X = 3) 

(d) Po(X2 = 1) 

(e) P2(X2 = 2) 

(f) Po(X2 = 3) 

(g) Po(X3 = 3) 

h) P(X% = 1) 

(i) PRX =7) 

0) Po(X2 =7) 

(k) P(X =7) 

(1) max, P2 (Xn = 7) (i.e., the largest probability of going from state 2 to state 7 in n 
steps, for any n) 

(m) Is this Markov chain irreducible? 

11.2.6 For each of the following transition probability matrices, determine (with ex- 
planation) whether it is irreducible, and whether it is aperiodic. 


” 0.2 0.8 
= (93 a 
(b) 
1/4 1/4 1/2 
(pij) =} 1/6 1/2 1/3 
1/8 3/8 1/2 
(c) we 
)=(93 a 
(d) 
0 1 0 
(pj) =| 13 1⁄3 1/3 
0 1 0 
(e) 


0 1 0 
(Ps)=] 9 0 1 
1 0 0 


640 Section 11.2: Markov Chains 


(f) 
0 1 0 
@p=( 0 0 1 
1/2 0 1/2 


11.2.7 Compute a stationary distribution for the Markov chain of Example 11.2.4. 
(Hint: Do not forget Example 11.2.15.) 

11.2.8 Show that the random walk on the circle process (see Example 11.2.6) is 
(a) irreducible. 

(b) aperiodic. 

(c) reversible with respect to its stationary distribution. 

11.2.9 Show that the Ehrenfest’s Urn process (see Example 11.2.7) is 

(a) irreducible. 

(b) not aperiodic. 

(c) reversible with respect to its stationary distribution. 

11.2.10 Consider the Markov chain with S = {1, 2, 3}, and 


0 1 0 
P; =| 0 0 1 
1/72 1/2 0 


(a) Determine (with explanation) whether or not the chain is irreducible. 
(b) Determine (with explanation) whether or not the chain is aperiodic. 
(c) Compute a stationary distribution for the chain. 

(d) Compute (with explanation) a good approximation to Pı (X500 = 2). 
11.2.11 Repeat all four parts of Exercise 11.2.10 if S = {1, 2, 3} and 


0 1/2 1/2 
@Pj={ 9 0 1 
1/2 1/2 0 
11.2.12 Consider a Markov chain with S = {1, 2, 3} and 
0.3 0.3 0.4 
(pij) =| 0.2 0.2 0.6 
0.1 0.2 0.7 


(a) Is this Markov chain irreducible and aperiodic? Explain. (Hint: Do not forget 
Theorem 11.2.7.) 

(b) Compute P| (X1 = 3). 

(c) Compute Pı (X2 = 3). 

(d) Compute P; (X3 = 3). 

(e) Compute limy_5 9 Pi(Xn = 3). (Hint: find a stationary distribution for the chain.) 
11.2.13 For the Markov chain of the previous exercise, compute P| (X1 + X2 > 5). 
11.2.14 Consider a Markov chain with S = {1, 2, 3} and 


1 
0 1 


Chapter 11: Advanced Topic — Stochastic Processes 641 


(a) Compute the period of each state. 
(b) Is this Markov chain aperiodic? Explain. 


11.2.15 Consider a Markov chain with S = {1, 2, 3} and 


0 1 0 
(pij) = 0.5 0 0.5 
0 1 0 


(a) Is this Markov chain irreducible? Explain. 
(b) Is this Markov chain aperiodic? Explain. 


PROBLEMS 


11.2.16 Consider a Markov chain with S = {1, 2, 3, 4, 5}, and 


1/5 4/5 0 0 0 
1/55 0 45 0 0 
(pj)=| 0 1/5 0 4/5 0 
0 0 1/5 0 4/5 
0 0 0 1/5 4/5 


Compute a stationary distribution {z;} for this chain. (Hint: Use reversibility, as in 
Example 11.2.19.) 

11.2.17 Suppose 100 lily pads are arranged in a circle, numbered 0,1,...,99 (with 
pad 99 next to pad 0). Suppose a frog begins at pad 0 and each second either jumps one 
pad clockwise, or jumps one pad counterclockwise, or stays where it is — each with 
probability 1/3. After doing this for a month, what is the approximate probability that 
the frog will be at pad 55? (Hint: The frog is doing random walk on the circle, as in 
Example 11.2.6. Also, the results of Example 11.2.17 and Theorem 11.2.8 may help.) 
11.2.18 Prove Theorem 11.2.4. (Hint: Proceed as in the proof of Theorem 11.2.3, and 
use induction.) 

11.2.19 In Example 11.2.18, prove that }°;.5a;pij = a; when j = 0 and when 
jJ=d. 


DISCUSSION TOPICS 


11.2.20 With a group, create the “human Markov chain” of Example 11.2.9. Make as 
many observations as you can about the long-term behavior of the resulting Markov 
chain. 


11.3 | Markov Chain Monte Carlo 


In Section 4.5, we saw that it is possible to estimate various quantities (such as prop- 
erties of real objects through experimentation, or the value of complicated sums or 
integrals) by using Monte Carlo techniques, namely, by generating appropriate random 


642 Section 11.3: Markov Chain Monte Carlo 


variables on a computer. Furthermore, we have seen in Section 2.10 that it is quite easy 
to generate random variables having certain special distributions. The Monte Carlo 
method was used several times in Chapters 6, 7, 9, and 10 to assist in the implementa- 
tion of various statistical methods. 

However, for many (in fact, most!) probability distributions, there is no simple, 
direct way to simulate (on a computer) random variables having such a distribution. 
We illustrate this with an example. 


EXAMPLE 11.3.1 
Let Z be a random variable taking values on the set of all integers, with 


P(Z = j) =C(j —1/2)*e "| cos? (j) (11.3.1) 

for j = ...,-2,—1,0,1,2,3,..., where C = 1/7 — 1/2)4e FU cos? (j). 
Now suppose that we want to compute the quantity A = E((Z — 20)’). 

Well, if we could generate i.i.d. random variables Y1, Y2,..., Ym with distribution 


given by (11.3.1), for very large M, then we could estimate A by 


Then Â would be a Monte Carlo estimate of A. 
The problem, of course, is that it is not easy to generate random variables Y; with 
this distribution. In fact, it is not even easy to compute the value of C. E 


Surprisingly, the difficulties described in Example 11.3.1 can sometimes be solved 
using Markov chains. We illustrate this idea as follows. 


EXAMPLE 11.3.2 
In the context of Example 11.3.1, suppose we could find a Markov chain on the state 
space S = {..., —2, -1,0,1,2,...} of all integers, which was irreducible and aperi- 
odic and which had a stationary distribution given by z; = C(j — 1/ 2)4e31/| cos? (j) 
forj € S. 

If we did, then we could run the Markov chain for a long time N, to get random 
values Xo, X1, X2,..., Xn. For large enough N, by Theorem 11.2.8, we would have 


P(Xy =j) ~ rj = CU — 1/2)*e7! cos? (j). 


Hence, if we set Y} = Xy, then we would have P(Y, = j) approximately equal to 
(11.3.1), for all integers j. That is, the value of Xy would be approximately as good 
as a true random variable Yı with this distribution. 

Once the value of Y; was generated, then we could repeat the process by again 
running the Markov chain, this time to generate new random values 


[2] yl2] yl2] [2] 
Xi A Xi XN 
(say). We would then have 


P(XR = j) ~ z; = CU —1/2)4e cos?(j). 


Chapter 11: Advanced Topic — Stochastic Processes 643 


Hence, if we set Y = X' 2, then we would have P(Y2 = j) approximately equal to 
(11.3.1), for all integers j. 

Continuing in this way, we could generate values Y1, Y2, Y3,..., Ym, such that 
these are approximately i.i.d. from the distribution given by (11.3.1). We could then, 
as before, estimate A by 


This time, the approximation has two sources of error. First, there is Monte Carlo 
error because M might not be large enough. Second, there is Markov chain error, 
because N might not be large enough. However, if M and N are both very large, then 
Â will be a good approximation to A. E 


We summarize the method of the preceding example in the following theorem. 


Theorem 11.3.1 (The Markov chain Monte Carlo method) Suppose we wish to 
estimate the expected value A = E (h(Z)), where P(Z = j) = 7; for j € S, with 
P(Z = j) = 0 for j ¢ S. Suppose for i = 1,2, ..., M, we can generate values 
x p] Jad t1, x E] IAN, t] from some Markov chain that is irreducible, aperiodic, and 
has {z ;} as a stationary distribution. Let 


EEE E 
= 37 hw). 
i=l 


If M and N are sufficiently large, then A ~ A. 


It is somewhat inefficient to run M different Markov chains. Instead, practitioners 
often just run a single Markov chain, and average over the different values of the chain. 
For an irreducible Markov chain run long enough, this will again converge to the right 
answer, as the following theorem states. 


Theorem 11.3.2 (The single-chain Markov chain Monte Carlo method) Suppose 
we wish to estimate the expected value A = E (h(Z)), where P(Z = j) = mj 
for j e S, with P(Z = j) = 0 for j ¢ S. Suppose we can generate values 
Xo, X1, X2,..., Xn from some Markov chain that is irreducible, aperiodic, and 
has {z ;} as a stationary distribution. For some integer B > 0, let 


If N — B is sufficiently large, then A ~ A. 


Here, B is the burn-in time, designed to remove the influence of the chain’s starting 
value Xo. The best choice of B remains controversial among statisticians. However, if 
the starting value Xo is “reasonable,” then it is okay to take B = 0, provided that N is 
sufficiently large. This is what was done, for instance, in Example 7.3.2. 


644 Section 11.3: Markov Chain Monte Carlo 


These theorems indicate that, if we can construct a Markov chain that has {z;} 
as a stationary distribution, then we can use that Markov chain to estimate quantities 
associated with {z;}. This is a very helpful trick, and it has made the Markov chain 
Monte Carlo method into one of the most popular techniques in the entire subject of 
computational statistics. 

However, for this technique to be useful, we need to be able to construct a Markov 
chain that has {z;} as a stationary distribution. This sounds like a difficult problem! 
Indeed, if {7;} were very simple, then we would not need to use Markov chain Monte 
Carlo at all. But if {7;} is complicated, then how can we possibly construct a Markov 
chain that has that particular stationary distribution? 

Remarkably, this problem turns out to be much easier to solve than one might 
expect. We now discuss one of the best solutions, the Metropolis—Hastings algorithm. 


11.3.1 | The Metropolis—Hastings Algorithm 


Suppose we are given a probability distribution {z;} on a state space S. How can we 
construct a Markov chain on S that has {7;} as a stationary distribution? 

One answer is given by the Metropolis—Hastings algorithm. It designs a Markov 
chain that proceeds in two stages. In the first stage, a new point is proposed from 
some proposal distribution. In the second stage, the proposed point is either accepted 
or rejected. If the proposed point is accepted, then the Markov chain moves there. If 
it is rejected, then the Markov chain stays where it is. By choosing the probability 
of accepting to be just right, we end up creating a Markov chain that has {7;} as a 
stationary distribution. 

The details of the algorithm are as follows. We start with a state space S, and 
a probability distribution {z;} on S. We then choose some (simple) Markov chain 
transition probabilities {q;; : i,j € S} called the proposal distribution. Thus, we 
require that q;; > 0, and > jes ij = 1 for each i e S. However, we do not assume 
that {z;} is a stationary distribution for the chain {q;;}; indeed, the chain {q;;} might 
not even have a stationary distribution. 

Given X, = i, the Metropolis—Hastings algorithm computes the value X,,4) as 
follows. 


1. Choose Y„+ı = j according to the Markov chain {q;;}. 


2. Set aj; = min {1 Ziti | (the acceptance probability). 


> Tiqij 


3. With probability a;;, let Xn+1 = Yn41 = j (Le., accepting the proposal Y;,+1). 
Otherwise, with probability 1 — a;;, let X,41 = Xn = i (i.e., rejecting the 
proposal Y,,41). 


The reason for this unusual algorithm is given by the following theorem. 


Theorem 11.3.3 The preceding Metropolis—Hastings algorithm results in a Markov 


chain Xo, X1, X2,..., which has {z;} as a stationary distribution. 


Chapter 11: Advanced Topic — Stochastic Processes 645 


PROOF | See Section 11.7 for the proof. E 


We consider some applications of this algorithm. 


EXAMPLE 11.3.3 
As in Example 11.3.1, suppose S = {..., —2, —1,0, 1,2, ...}, and 


xj =C —1/2)*e"l cos? (j), 


for j e S. We shall construct a Markov chain having {z;} as a stationary distribution. 
We first need to choose some simple Markov chain {qi;}. We let {qij} be simple 
random walk with p = 1/2, so that gj, = 1/2if j =i +1 or j =i — 1, and qij =0 
otherwise. 
We then compute that if j =i + 1 or j =i — 1, then 


; | qjiT j | | (1/2)CG — 1/2)te Vl cos? (j) 
aj = minįjl, ——~}; =min 4l, ~ a O 
qijTi (1/2)C@ — 1/2)4e cos? (i) | 
(j — 1/2)fe Vl cos? (j) 
(i — 1/2)4ei cos? (i) | ý 
Note that C has cancelled out, so that a;; does not depend on C. (In fact, this will 
always be the case.) Hence, we see that a;;, while somewhat messy, is still very easy 
for a computer to calculate. 
Given X, = i, the Metropolis—Hastings algorithm computes the value X,41 as 
follows. 


= min f1, (11.3.2) 


1. Let ¥,4,; = Xn + 1 or Y,4; = Xn — 1, with probability 1/2 each. 
2. Let j = Yn41, and compute aj; as in (11.3.2). 


3. With probability a;;, let Xn+1 = Yn41 = j. Otherwise, with probability 1 —a;;, 
let Xn+1 = Xn =A 


These steps can all be easily performed on a computer. If we repeat this for n = 
0,1,2,..., N—1, for some large number N of iterations, then we will obtain a random 
variable Xy, where P(Xy = j/) ~% aj =CU —- 1/2)fe 73V] cos*(j), forall j € S.E 
EXAMPLE 11.3.4 
Again, let S = {..., —2, —1,0, 1,2, ...}, and this time let z; = Kei, for j e S. 
Let the proposal distribution {g;;} correspond to a simple random walk with p = 1/4, 
so that Y,41; = Xn + 1 with probability 1/4, and Y,4; = Xn — 1 with probability 3/4. 

In this case, we compute that if j =i + 1, then 


j4 
aza) min fı (3/4)Ke7/ 


= min f1, 3e“). (11.3.3) 
qijTi (1/4)Ke-“ 


Qij = min 1, 


If instead j =i — 1, then 


Qij 


“4 
eight 1/4)Ke7/ 
min 1, S27 | = min jee gaye 7 
Qij Ti (3/4)Ke— 


min En (1/3)e =} i (11.3.4) 


646 Section 11.3: Markov Chain Monte Carlo 


(Note that the constant K has again cancelled out, as expected.) Hence, again aj;; is 
very easy for a computer to calculate. 

Given X, = i, the Metropolis—Hastings algorithm computes the value X,41 as 
follows. 


1. Let Yn41 = Xn + 1 with probability 1/4, or Y,41 = Xn — 1 with probability 
3/4. 


2. Let j = Y,41, and compute a;; using (11.3.3) and (11.3.4). 


3. With probability aij, let Xn41 = Yn+1ı = j. Otherwise, with probability 1 —a;;, 
let Xn+1 = Xn =i. 


Once again, these steps can all be easily performed on a computer; if repeated for 
some large number N of iterations, then P(Xn = j) ¥ a; =K e-i" for jes.l 

The Metropolis—Hastings algorithm can also be used for continuous random vari- 
ables by using densities, as follows. 
EXAMPLE 11.3.5 


Suppose we want to generate a sample from the distribution with density proportional 
to 
_ 4 
fO) =e +l). 

So the density is Cf (y), where C = 1/ f% f(y) dy. How can we generate a random 
variable Y such that Y has approximately this distribution, i.e., has probability density 
approximately equal to Cf (y)? 

Let us use a proposal distribution given by an N(x, 1) distribution, namely, a nor- 
mal distribution with mean x and variance 1. That is, given X, = x, we choose Yp+1 


by Yn41 ~ N(x, 1). Because the N(x, 1) distribution has density (2x )7!/2 e70} /2, 
this corresponds to a proposal density of q(x, y) = Qr)! e702, 
As for the acceptance probability a(x, y), we again use densities, so that 
CfO)qQ, x) | 
' Cf@)q C, y) 


— minl (14.1 3 Ce™ (Qr) 2 e OPP 
= "AL ixl/ Ce-** (2m)! e- 6-9/2 


3 
San (2) ett | (11.3.5) 
14+ |x| 


Given X, = x, the Metropolis—Hastings algorithm computes the value X;,41 as 
follows. 


a(x, y) = min (1 


1. Generate Y,41 ~ N(Xn, 1). 
2. Let y = Y,,41, and compute a(x, y), as before. 


3. With probability a(x, y), let Xn+1 = Yn41 = y. Otherwise, with probability 
1 — a(x, y), let Xn+1 = Xn =x. 


Chapter 11: Advanced Topic — Stochastic Processes 647 


Once again, these steps can all be easily performed on a computer; if repeated for 
some large number N of iterations, then the random variable Xy will approximately 
have density given by Cf (y). ll 


11.3.2 | The Gibbs Sampler 


In Section 7.3.3 we discussed the Gibbs sampler and its application in a Bayesian 
statistics problem. As we will now demonstrate, the Gibbs sampler is a specialized 
version of the Metropolis—Hastings algorithm, designed for multivariate distributions. 
It chooses the proposal probabilities q;; just right so that we always have a;; = 1, i.e., 
so that no rejections are ever required. 

Suppose that S$ = {..., —2, —1, 0, 1,2,...} x {..., —2, -1,0,1,2,...}, i.e., Sis 
the set of all ordered pairs of integers i = (i1, i2). (Thus, (2, 3) € S, and (—6, 14) € S, 
etc.) Suppose that some distribution {z ;} is defined on S. Define a proposal distribution 
uD) as follows. 

Let V (i) = {7 € S: j2 = i2}. That is, V (i) is the set of all states j € S such that i 
and j agree in their second coordinate. Thus, V (i) is a vertical line in S, which passes 
through the point i. 


In terms of this definition of V (i), define qj = 0 if j g V (i), ie~ ifi and j differ 


in their second coordinate. If j € V (i), i.e., ifi and j agree in their second coordinate, 
then define 
Dire v(i) Tk 


One interpretation is that, if Xn = i, and P(Yn+1 = j) = g for j € S, then the 
distribution of Y,+41 is the conditional distribution of {7 i }, conditional on knowing that 
the second coordinate must be equal to i2. 

In terms of this choice of aa what is a;;? Well, if j e V (i), theni e V (j), and 
also V (j) = V (i). Hence, 


njai . Tj (i/ Erro rk) 
min } 1, ——— +} = min } 1, ——2—— 
Ti (z;/Zievo7ı) 


| =min (1,1) =1. 


Qij = 


TiTi 


= minjl, 
Minj 


That is, this algorithm accepts the proposal Y„ı with probability 1, and never rejects 
at all! 

Now, this algorithm by itself is not very useful because it proposes only states in 
V (i), so it never changes the value of the second coordinate at all. However, we can 
similarly define a horizontal line through i by H(i) = {j € S: jı = i1}, so that H(i) 
is the set of all states 7 such that i and j agree in their first coordinate. That is, H(i) is 
a horizontal line in S that passes through the point i. 


648 Section 11.3: Markov Chain Monte Carlo 


We can then define re = 0if 7 g H(i) (i.e., ifi and j differ in their first coordi- 
nate), while if j e V (i) (.e., if i and j agree in their first coordinate), then 
4 È kena) Tk 


As before, we compute that for this proposal, we will always have aj; = 1, i.e., the 
Metropolis—Hastings algorithm with this proposal will never reject. 

The Gibbs sampler works by combining these two different Metropolis—Hastings 
algorithms, by alternating between them. That is, given a value X, = i, it produces a 
value X;,41 as follows. 


1. Propose a value Yn41 € V (i), according to the proposal distribution Ca 


2. Always accept Y,,4, and set j = Y„+1, thus moving vertically. 


3. Propose a value Z,4; € H (j), according to the proposal distribution Ca 
4. Always accept Zn+1, thus moving horizontally. 
5. Set Xn41 = Zn41. 


In this way, the Gibbs sampler does a “zigzag” through the state space S, alternately 
moving in the vertical and in the horizontal direction. 
In light of Theorem 11.3.2, we immediately obtain the following. 


Theorem 11.3.4 The preceding Gibbs sampler algorithm results in a Markov chain 


Xo, X1, X2,... that has {7;} as a stationary distribution. 


The Gibbs sampler thus provides a particular way of implementing the Metropolis— 
Hastings algorithm in multidimensional problems, which never rejects the proposed 
values. 


Summary of Section 11.3 


e In cases that are too complicated for ordinary Monte Carlo techniques, it is pos- 
sible to use Markov chain Monte Carlo techniques instead, by averaging values 
arising from a Markov chain. 


e The Metropolis—Hastings algorithm provides a simple way to create a Markov 
chain with stationary distribution {z;}. Given Xņ, it generates a proposal Y,,41 
from a proposal distribution {g;;}, and then either accepts this proposal (and sets 
Xn+1 = Yn+1) with probability a;;, or rejects this proposal (and sets X,4) = 
Xn) with probability 1 — aij. 

Alternatively, the Gibbs sampler updates the coordinates one at a time from their 
conditional distribution, such that we always have a;; = 1. 


Chapter 11: Advanced Topic — Stochastic Processes 649 


EXERCISES 


11.3.1 Suppose z; = Ce~“—3)" fori e S = {..., —2, —1, 0, 1,2, ...}, where C = 
1 Jpn e~ 4-13)" Describe in detail a Metropolis—Hastings algorithm for {z;}, 
which uses simple random walk with p = 1/2 for the proposals. 

11.3.2 Suppose m; = C(i + 6.5)78 fori € S = {...,—2,—-1,0,1,2,...}, where 
Gals 20+ 6.5)~8. Describe in detail a Metropolis—Hastings algorithm for 
{z;}, which uses simple random walk with p = 5/8 for the proposals. 

11.3.3 Suppose m; = Ke“ i-i? fori e S = {...,—-2,—-1,0,1,2,...}, where 
C= Sree et -i°-i* | Describe in detail a Metropolis—Hastings algorithm for 
{z;}, which uses simple random walk with p = 7/9 for the proposals. 

11.3.4 Suppose f(x) = e= xix? for x e RI. Let K = 1/ tee e= =t x8 dx, 
Describe in detail a Metropolis—Hastings algorithm for the distribution having density 
K f(x), which uses the proposal distribution N(x, 1), i.e., a normal distribution with 
mean x and variance 1. 

11.3.5 Let f(x) = e™™ -f -*" for x e R!, and let K = 1/ So e™ ==? dy, 
Describe in detail a Metropolis—Hastings algorithm for the distribution having density 
K f(x), which uses the proposal distribution N(x, 10), i.e., a normal distribution with 
mean x and variance 10. 


COMPUTER EXERCISES 


11.3.6 Run the algorithm of Exercise 11.3.1. Discuss the output. 


11.3.7 Run the algorithm of Exercise 11.3.2. Discuss the output. 


PROBLEMS 


11.3.8 Suppose S = {1,2,3,...}x{1,2,3,...}, i.e., Sis the set ofall pairs of positive 
integers. For i = (i1, i2) € S, suppose m; = C /2'!*” for appropriate positive constant 
C. Describe in detail a Gibbs sampler algorithm for this distribution {z ;}. 


COMPUTER PROBLEMS 


11.3.9 Run the algorithm of Exercise 11.3.4. Discuss the output. 
11.3.10 Run the algorithm of Exercise 11.3.5. Discuss the output. 


DISCUSSION TOPICS 


11.3.11 Why do you think Markov chain Monte Carlo algorithms have become so 
popular in so many branches of science? (List as many reasons as you can.) 


11.3.12 Suppose you will be using a Markov chain Monte Carlo estimate of the form 


1 M ti 

4 i 

=— J AXK). 
i=l 


650 Section 11.4: Martingales 


Suppose also that, due to time constraints, your total number of iterations cannot be 
more than one million. That is, you must have NM < 1,000,000. Discuss the advan- 
tages and disadvantages of the following choices of N and M. 

(a) N =1,000,000 M = 1 

(b) N = 1, M =1,000,000 

(c) N = 100, M = 10,000 

(d) N = 10,000, M = 100 

(e) N = 1000, M = 1000 

(f) Which choice do you think would be best, under what circumstances? Why? 


11.4 | Martingales 


In this section, we study a special class of stochastic processes called martingales. We 
shall see that these processes are characterized by “staying the same on average.” 

As motivation, consider again a simple random walk in the case of a fair game, i.e., 
with p = 1/2. Suppose, as in the gambler’s ruin setup, that you start at a and keep 
going until you hit either c or 0, where 0 < a < c. Let Z be the value that you end up 
with, so that we always have either Z = c or Z = 0. We know from Theorem 11.1.2 
that in fact P(Z = c) = a/c, so that P(Z = 0) = 1 — a/c. 

Let us now consider the expected value of Z. We have that 


E(Z) = 5 zP(Z =z) =cP(Z =c) + 0P(Z = 0) =c(a/c) =a. 
zeR! 
That is, the average value of where you end up is a. But a is also the value at which 
you started! 


This is not a coincidence. Indeed, because p = 1/2 (i.e., the game was fair), this 
means that “on average” you always stayed at a. That is, {Xn} is a martingale. 


11.4.1 | Definition of a Martingale 


We begin with the definition of a martingale. For simplicity, we assume that the mar- 
tingale is a Markov chain, though this is not really necessary. 


Definition 11.4.1 Let Xo, X1, X2, ... bea Markov chain. The chain is a martingale 
if for all n = 0,1,2,..., we have E (Xn+1 — Xn | X,) = 0. That is, on average the 


chain’s value does not change, regardless of what the current value X, actually is. 


EXAMPLE 11.4.1 
Let {X,,} be simple random walk with p = 1/2. Then X,,4; — Xn is equal to either 1 
or —1, with probability 1/2 each. Hence, 


E(Xn41 — Xn | Xn) = (0/2) + (-DC/2) = 9, 


so {Xn} stays the same on average and is a martingale. (Note that we will never actually 
have X,41; — Xn = 0. However, on average we will have X;,4; — Xn =0.) E 


Chapter 11: Advanced Topic — Stochastic Processes 651 


EXAMPLE 11.4.2 
Let {Xn} be simple random walk with p = 2/3. Then X41 — Xn is equal to either 1 
or —1, with probabilities 2/3 and 1/3, respectively. Hence, 


E(Xn41 — Xn | Xn) = 1)@/3) + (-DC/3) = 1/3 # 0. 


Thus, {Xn} is not a martingale in this case. E 


EXAMPLE 11.4.3 

Suppose we start with the number 5 and then repeatedly do the following. We either add 
3 to the number (with probability 1/4), or subtract 1 from the number (with probability 
3/4). Let Xn be the number obtained after repeating this procedure n times. Then, 
given the value of X;,, we see that X,4; = Xn +3 with probability 1/4, while X,4) = 
Xn — 1 with probability 3/4. Hence, 


E(Xn41 — Xn | Xn) = B)C/4) + C 18/4) = 3/4 — 3/4 = 0 
and {X,,} is a martingale. E 


It is sometimes possible to create martingales in subtle ways, as follows. 


EXAMPLE 11.4.4 
Let {Xn} again be simple random walk, but this time for general p. Then X41 — Xn 
is equal to 1 with probability p, and to —1 with probability q = 1 — p. Hence, 


E(Xns1 — Xn | Xn) = DP) + (-D@ =p -4 =2p-1. 


If p Æ 1/2, then this is not equal to 0. Hence, {X,,} does not stay the same on average, 
so {Xn} is not a martingale. 
On the other hand, let 


i.e., Zn equals the constant (1 — p)/p raised to the power of X,. Then increasing X, by 
1 corresponds to multiplying Z, by (1 — p)/p, while decreasing X,, by 1 corresponds 
to dividing Zn by (1 — p)/p, i.e., multiplying by p/(1 — p). But Xn41 = Xn +1 with 
probability p, while X;,41 = Xn — 1 with probability q = 1 — p. Therefore, we see 
that, given the value of Zn, we have 


(=z - Za) p+ (2,2 = zn) -= p) 
P Le 
(Q = p)Zn = pZn) + (PZn = (1 = p)Zn) = 0. 


E(Zn+1 — Zn | Zn) 


Accordingly, E (Zn+1 — Zn | Zn) = 0, so that {Z,,} stays the same on average, i.e., {Zn} 
is a martingale. E 


11.4.2 | Expected Values 


Because martingales stay the same on average, we immediately have the following. 


652 Section 11.4: Martingales 


Theorem 11.4.1 Let {X,,} be a martingale with Xọ = a. Then E(X,,) = a for all 
n. 


This theorem sometimes provides very useful information, as the following exam- 
ples demonstrate. 


EXAMPLE 11.4.5 

Let {Xn} again be simple random walk with p = 1/2. Then we have already seen that 
{Xn} is a martingale. Hence, if Xo = a, then we will have E(X,,) = a for all n. That 
is, for a fair game (i.e., for p = 1/2), no matter how long you have been gambling, 
your average fortune will always be equal to your initial fortune a. E 


EXAMPLE 11.4.6 

Suppose we start with the number 10 and then repeatedly do the following. We either 
add 2 to the number (with probability 1/3), or subtract 1 from the number (with proba- 
bility 2/3). Suppose we repeat this process 25 times. What is the expected value of the 
number we end up with? 

Without martingale theory, this problem appears to be difficult, requiring lengthy 
computations of various possibilities for what could happen on each of the 25 steps. 
However, with martingale theory, it is very easy. 

Indeed, let X, be the number after n steps, so that Xo = 10,X, = 12 (with 
probability 1/3) or Xı = 9 (with probability 2/3), etc. Then, because Xn+1 — Xn 
equals either 2 (with probability 1/3) or —1 (with probability 2/3), we have 


E(Xn41 — Xn | Xn) = 21/3) + (I) @/3) = 2/3 — 2/3 = 0. 


Hence, {Xn} is a martingale. 
It then follows that E (Xa) = Xo = 10, for any n. In particular, E (X25) = 10. 
That is, after 25 steps, on average the number will be equal to 10. E 


11.4.3 | Stopping Times 


If {Xn} is a martingale with Xo = a, then it is very helpful to know that E (Xn) = a 
for all n. However, it is sometimes even more helpful to know that E (Xr) = a, where 
T is a random time. Now, this is not always true; however, it is often true, as we shall 
see. We begin with another definition. 


Definition 11.4.2 Let {X,,} be a stochastic process, and let T be a random variable 
taking values in {0, 1,2,...}. Then T is a stopping time if for all m = 0, 1,2,..., 


the event {T = m} is independent of the values Xm+1, Xm42,.... That is, when 
deciding whether or not T = m (i.e., whether or not to “stop” at time m), we are 
not allowed to look at the future values Xm+1, Xm42,..-- 


EXAMPLE 11.4.7 

Let {Xn} be simple random walk, let b be any integer, and let 7, = min{n >0: Xn = 
b} be the first time we hit the value b. Then t» is a stopping time because the event 
tp =n depends only on Xo,..., Xn, not on X41, Xn42,..-- 


Chapter 11: Advanced Topic — Stochastic Processes 653 


On the other hand, let T = ta — 1, so that T corresponds to stopping just before 
we hit b. Then T is not a stopping time because it must look at the future value Xm+1 
to decide whether or not to stop at time m. E 


A key result about martingales and stopping times is the optional stopping theorem, 
as follows. 


Theorem 11.4.2 (Optional stopping theorem) Suppose {Xn} is a martingale with 
Xo = a, and T is a stopping time. Suppose further that either 

(a) the martingale is bounded up to time T, i.e., for some M > 0 we have |X,| < M 
for alln < T; or 

(b) the stopping time is bounded, i.e., for some M > 0 we have T < M. 


Then E(X7) = a, i.e., on average the value of the process at the random time T is 
equal to the starting value a. 


PROOF | For a proof and further discussion, see, e.g., page 273 of Probability: The- 
ory and Examples, 2nd ed., by R. Durrett (Duxbury Press, New York, 1996). E 


Consider a simple application of this. 


EXAMPLE 11.4.8 

Let {Xn} be simple random walk with initial value a and with p = 1/2. Letr > a > s 
be integers. Let T = min{t;, Ts} be the first time the process hits either r or s. Then 
r > Xn > s forn < T, so that condition (a) of the optional stopping theorem applies. 
We conclude that E (Xr) = a, i.e., that at time T, the walk will on average be equal to 
a.l 


We shall see that the optional stopping theorem is useful in many ways. 


EXAMPLE 11.4.9 
We can use the optional stopping theorem to find the probability that the simple random 
walk with p = 1/2 will hit r before hitting another value s. 

Indeed, again let {X„} be simple random walk with initial value a and p = 1/2, 
withr > a > s integers and T = min{t,, Ts}. Then as earlier, E (Xr) = a. We can 
use this to solve for P(Xr = r), i.e., for the probability that the walk hits r before 
hitting s. 

Clearly, we always have either Xr =r or Xr = s. Leth = P(X7 = r). Then 
E(Xr) = hr + (1 — h)s. Because E(Xr) = a, we must have a = hr + (1 — h)s. 


Solving for h, we see that 
Porere —— 


r—s- 

We conclude that the probability that the process will hit r before it hits s is equal 
to (a — s)/(r — s). Note that absolutely no difficult computations were required to 
obtain this result. E 


A special case of the previous example is particularly noteworthy. 


654 Section 11.4: Martingales 


EXAMPLE 11.4.10 

In the previous example, suppose r = c ands = 0. Then the value h = P(Xr =r) 
is precisely the same as the probability of success in the gambler’s ruin problem. The 
previous example shows that h = (a — s)/(r — s) = a/c. This gives the same answer 
as Theorem 11.1.2, but with far less effort. E 


It is impressive that, in the preceding example, martingale theory can solve the 
gambler’s ruin problem so easily in the case p = 1/2. Our previous solution, without 
using martingale theory, was much more difficult (see Section 11.7). Even more sur- 
prising, martingale theory can also solve the gambler’s ruin problem when p Æ 1/2, as 
follows. 


EXAMPLE 11.4.11 
Let {Xn} be simple random walk with initial value a and with p Æ 1/2. LetO0 <a < c 
be integers. Let T = min {7¢, To} be the first time the process hits either c or 0. To 
solve the gambler’s ruin problem in this case, we are interested in g = P(Xr = c). 
We can use the optional stopping theorem to solve for the gambler’s ruin probability g, 
as follows. 

Now, {Xn} is not a martingale, so we cannot apply martingale theory to it. However, 
let 


Then {Zn} has initial value Zo = ((1 — p)/p)*. Also, we know from Example 11.4.4 
that {Zn} is a martingale. Furthermore, 


<2, s nan | (2) (=) | 


forn < T, so that condition (a) of the optional stopping theorem applies. We conclude 


that r 
]— 
RZ = Zo = (=) l 
p 


Now, clearly, we always have either Xr = c (with probability g) or Xr = 0 
(with probability 1 — g). In the former case, Zr = ((1 — p)/p)°, while in the latter 
case, Zr = 1. Hence, E(Zr) = g((1 — p)/p) + (l — 2)(1). Because E (Zr) = 
((1 — p) /p)*, we must have 


(=) =g (=) + (1 —g)(1). 
P p 


Solving for g, we see that 
td p/p -1 
(d -= p)/p)e -1 
This again gives the same answer as Theorem 11.1.2, this time for p 4 1/2, but 
again with far less effort. E 


Martingale theory can also tell us other surprising facts. 


Chapter 11: Advanced Topic — Stochastic Processes 655 


EXAMPLE 11.4.12 

Let {Xn} be simple random walk with p = 1/2 and with initial value a = 0. Will 
the walk hit the value —1 some time during the first million steps? Probably yes, but 
not for sure. Furthermore, conditional on not hitting —1, it will probably be extremely 
large, as we now discuss. 

Let T = min{10°, r_1}. That is, T is the first time the process hits —1, unless that 
takes more than one million steps, in which case T = 10°. 

Now, {Xn} is a martingale. Also T is a stopping time (because it does not look 
into the future when deciding whether or not to stop). Furthermore, we always have 
T < 10°, so condition (b) of the optional stopping theorem applies. We conclude that 
E(Xr) =a =0. 

On the other hand, by the law of total expectation, we have 


E(Xr) = E(Xr|Xr = —1)P(Xr = -1)+ E(Xr | Xr # —-1)P (Xr £ -1). 


Also, clearly E (Xr | Xr = —1) = —1. Letu = P(Xr = —1), so that P(Xr # 
—1) = 1 — u. Then we conclude that 0 = (—1)u + E (Xr | Xr # —1)(1 — u), so that 


E(Xr |Xr # -1) = — 


l-u 


Now, clearly, u will be very close to 1, i.e., it is very likely that within 10° steps the 
process will have hit —1. Hence, E (Xr | X7 4 —1) is extremely large. 


We may summarize this discussion as follows. Nearly always we have Xy = —1. 
However, very occasionally we will have Xr # —1. Furthermore, the average value 
of Xr when Xr #4 —1 is so large that overall (i.e., counting both the case Xr = —1 


and the case Xr 4 —1), the average value of Xr is 0 (as it must be because {X,} is a 
martingale)! E 


If one is not careful, then it is possible to be tricked by martingale theory, as follows. 


EXAMPLE 11.4.13 

Suppose again that {X,,} is simple random walk with p = 1/2 and with initial value 
a = 0. Let T = t1, i.e., T is the first time the process hits —1 (no matter how long 
that takes). 

Because the process will always wait until it hits —1, we always have Xr = —1. 
Because this is true with probability 1, we also have E(X7) = —1. 

On the other hand, again {Xn} is a martingale, so again it appears that we should 
have E(X 7) = 0. What is going on? 

The answer, of course, is that neither condition (a) nor condition (b) of the optional 
stopping theorem is satisfied in this case. That is, there is no limit to how large T might 
have to be or how large Xn might get for some n < T. Hence, the optional stopping 
theorem does not apply in this case, and we cannot conclude that E (Xr) = 0. Instead, 
E(Xr) = —1 here. i 


Summary of Section 11.4 


e A Markov chain {X,} is a martingale if it stays the same on average, i.e., if 
E(Xn41 — Xn | Xn) = 0 for all n. There are many examples. 


656 Section 11.4: Martingales 


e A stopping time T for the chain is a nonnegative integer-valued random variable 
that does not look into the future of {X,,}. For example, perhaps T = ty is the 
first time the chain hits some state b. 


e If {X,} is a martingale with stopping time T, and if either T or {X,}n<r is 
bounded, then E(X7) = Xo. This can be used to solve many problems, e.g., 
gambler’s ruin. 


EXERCISES 


11.4.1 Suppose we define a process {Xn} as follows. Given X;,, with probability 3/8 
we let Xn41 = Xn — 4, while with probability 5/8 we let X;41 = Xn + C. What value 
of C will make {X,} be a martingale? 

11.4.2 Suppose we define a process {X} as follows. Given X,, with probability p we 
let Xn41 = Xn + 7, while with probability 1 — p we let X,41 = Xn — 2. What value 
of p will make {X,} be a martingale? 

11.4.3 Suppose we define a process {X,} as follows. Given X,,, with probability p we 
let X41 = 2X, while with probability 1 — p we let X,4; = X,/2. What value of p 
will make {X,,} be a martingale? 

11.4.4 Let {X,,} be a martingale, with initial value Xo = 14. Suppose for some n, we 
know that P(X, = 8) + P(X, = 12) + P(X, = 17) = 1, i.e., X, is always either 8, 
12, or 17. Suppose further that P(X, = 8) = 0.1. Compute P (Xn = 14). 

11.4.5 Let {Xn} be a martingale, with initial value Xọ = 5. Suppose we know that 
P(Xg = 3) + P(Xg = 4) + P(Xg = 6) = l, i.e., Xg is always either 3, 4, or 6. 
Suppose further that P (Xg = 3) = 2 P (Xg = 6). Compute P(X; = 4). 

11.4.6 Suppose you start with 175 pennies. You repeatedly flip a fair coin. Each time 
the coin comes up heads, you win a penny; each time the coin comes up tails, you lose 
a penny. 

(a) After repeating this procedure 20 times, how many pennies will you have on aver- 
age? 

(b) Suppose you continue until you have either 100 or 200 pennies, and then you stop. 
What is the probability you will have 200 pennies when you stop? 

11.4.7 Define a process {Xn} by Xo = 27, and Xn+1 = 3X, with probability 1/4, or 
Xn+1 = X,/3 with probability 3/4. Let T = min {r1, 781} be the first time the process 
hits either 1 or 81. 

(a) Show that {X,,} is a martingale. 

(b) Show that 7 is a stopping time. 

(c) Compute E (Xr). 

(d) Compute the probability P (Xr = 1) that the process hits 1 before hitting 81. 


PROBLEMS 


11.4.8 Let {X;,} be a stochastic process, and let T; be a stopping time. Let 77 = Ti +i 
and 73 = Tı — i, for some positive integer i. Which of 7 and 73 is necessarily a 
stopping time, and which is not? (Explain your reasoning.) 


Chapter 11: Advanced Topic — Stochastic Processes 657 


11.4.9 Let {X,} be a stochastic process, and let 7; and T) be two different stopping 
times. Let 73 = min {7, T2}, and 74 = max {7), T2}. 

(a) Is 73 necessarily a stopping time? (Explain your reasoning.) 

(b) Is 74 necessarily a stopping time? (Explain your reasoning.) 


11.5 | Brownian Motion 


The simple random walk model of Section 11.1.2 (with p = 1/2) can be extended to 
an interesting continuous-time model, called Brownian motion, as follows. Roughly, 
the idea is to speed up time faster and faster by a factor of M (for very large M), 
while simultaneously shrinking space smaller and smaller by a factor of 1/./M. The 
factors of M and 1/./M are chosen just right so that, using the central limit theorem, 
we can derive properties of Brownian motion. Indeed, using the central limit theorem, 
we shall see that various distributions related to Brownian motion are in fact normal 
distributions. 

Historically, Brownian motion gets its name from Robert Brown, a botanist, who 
in 1828 observed the motions of tiny particles in solution, under a microscope, as 
they were bombarded from random directions by many unseen molecules. Brownian 
motion was proposed as a model for the observed chaotic, random movement of such 
particles. In fact, Brownian motion turns out not to be a very good model for such 
movement (for example, Brownian motion has infinite derivative, which would only 
make sense if the particles moved infinitely quickly!). However, Brownian motion has 
many useful mathematical properties and is also very important in the theory of finance 
because it is often used as a model of stock price fluctuations. A proper mathematical 
theory of Brownian motion was developed in 1923 by Norbert Wiener’; as a result, 
Brownian motion is also sometimes called the Wiener process. 

We shall construct Brownian motion in two steps. First, we construct faster and 
faster random walks, to be called {Y, 00) where M is large. Then, we take the limit as 
M —> oo to get Brownian motion. 


11.5.1 | Faster and Faster Random Walks 


To begin, we let Z1, Z2,... be i.i.d. with P(Z; = +1) = P(Z; = —1) = 1/2. For 
each M e {1,2,...}, define a discrete-time random process 


Baye) zi =0,1,...}, 
by ye” = 0, and 
1 
y2 ware E Zi+ls 
M 


G+1)/ JM 


fori = 0,1,2,... so that 


1 
YM = eZ + Zp +--+ Zi). 


Tit 


2 Wiener was such an absent-minded professor that he once got lost and could not find his house. In his 
confusion, he asked a young girl for directions, without recognizing the girl as his daughter! 


658 Section 11.5: Brownian Motion 


Intuitively, then, E is like an ordinary (discrete-time) random walk (with p = 
1/2), except that time has been sped up by a factor of M and space has been shrunk 
by a factor of ./M (each step in the new walk moves a distance 1 / JM). That is, this 
process takes lots and lots of very small steps. 

To make e into a continuous-time process, we can then “fill in” the missing 
values by making the function linear on the intervals [i /M, (i + 1) /M]. In this way, 
we obtain a continuous-time process 


(w700: ¢ > 0}, 


which agrees with cay) whenever t = 1/M. In Figure 11.5.1, we have plotted 


(Vig ot 0, Lets 20) 
(the dots) and the corresponding values of 
{(v7O 0 <t < 20} 
(the solid line), arising from the realization 
(Z1,..., Z20) = (1, -1, -1, -1, -1, 1,...), 


where we have taken 1/./10 = 0.316. 


0.949 — 


0.632 — 


0.316 — 


0.000 — 


-0.316 — 


-0.632 — 


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 
t 


Figure 11.5.1: Plot of some values of Yo and ye. 


The collection of variables {Y, oy) : t > 0} is then a stochastic process but is now 
indexed by the continuous time parameter t > 0. This is an example of a continuous- 
time stochastic process. 

Now, the factors M and V/M have been chosen carefully, as the following theorem 
illustrates. 


Chapter 11: Advanced Topic — Stochastic Processes 659 


Theorem 11.5.1 Let {Y, Ms) : t > 0} be as defined earlier. Then for large M: 

(a) For ¢ > 0, the distribution of Y, T is approximately N (0, t), i.e., normally 
distributed with mean f¢. 

(b) For s,¢ > 0, the covariance 


Cov mae ee) 


is approximately equal to min {s, t}. 

(c) For t > s > 0, the distribution of the increment Y, AOD — ye) is approxi- 
mately N(0,¢ — s), i.e., normally distributed with mean 0 and variance t — s, and 
is approximately independent of Y; an 

(d) Y¥*“ is a continuous function of f. 


PROOF | See Section 11.7 for the proof of this result. E 


We shall use this limit theorem to construct Brownian motion. 


11.5.2 | Brownian Motion as a Limit 


We have now developed the faster and faster processes yr) : £ > 0}, and some 
of their properties. Brownian motion is then defined as the limit as M — oo of the 
processes gzon : t > 0}. That is, we define Brownian motion {B; : t > 0} by saying 
that the distribution of {B; : t > 0} is equal to the limit as M — oo of the distribution 
of fy) : t > 0}. A graph of a typical run of Brownian motion is in Figure 11.5.2. 


Figure 11.5.2: A typical outcome from Brownian motion. 


In this way, all the properties of Y;" ) for large M, as developed in Theorem 11.5.1, 
will apply to Brownian motion, as follows. 


660 Section 11.5: Brownian Motion 


Theorem 11.5.2 Let {B; : t > 0} be Brownian motion. Then 
(a) B; is normally distributed: B; ~ N (0, t) for any t > 0; 
(b) Cov(B;, Bt) = E (Bs B+) = min {s, t} for s, t > 0; 


(c)if0 < s < t, then the increment B; — Bs is normally distributed: B; — Bs ~ 
N (0, t — s), and furthermore B; — Bs is independent of Bs; 
(d) the function {B;},;>0 is a continuous function. 


This theorem can be used to compute many things about Brownian motion. 


EXAMPLE 11.5.1 
Let {B;} be Brownian motion. What is P(Bs < 3)? 
We know that Bs; ~ N(0, 5). Hence, Bs//5 ~ N(0, 1). We conclude that 


P(Bs <3) = P(Bs/V5 < 3/5) = ®3/V5) = 0.910, 


where 


(x) si. mA ds 


—œ y 2T 
is the cdf of a standard normal distribution, and we have found the numerical value 
from Table D.2. Thus, about 91% of the time, Brownian motion will be less than 3 at 
time 5. E 


EXAMPLE 11.5.2 
Let {B;} be Brownian motion. What is P (B7 > —4)? 
We know that B7 ~ N(0, 7). Hence, B7/./7 ~ N (0, 1). We conclude that 


P(B7 > —4) = 1-— P(B < —4) = 1 — P(By/V7 < -4/V7) 
1 — &(—4/V/7) = 1 — 0.065 = 0.935. 
Thus, over 93% of the time, Brownian motion will be at least —4 at time 7. E 


EXAMPLE 11.5.3 
Let {B;} be Brownian motion. What is P(Bg — Be < —1.5)? 

We know that Bg — B6 ~ N (0, 8—6) = N (0, 2). Hence, (Bg — Be) / v2 ~ N(0,1). 
We conclude that 


P(Bg — Bo < —1.5) = P((Bg — Bo)/V2 < -1.5/V2) = ®(—1.5/4/2) = 0.144. 


Thus, about 14% of the time, Brownian motion will decrease by at least 1.5 between 
time 6 and time 8. E 


EXAMPLE 11.5.4 
Let {B;} be Brownian motion. What is P(B2 < —0.5, Bs — B2 > 1.5)? 
By Theorem 11.5.2, we see that B5 — B2 and By are independent. Hence, 


P(B < —0.5, Bs — B2 > 1.5) = P(B2 < —0.5) P(Bs — B2 > 1.5). 
Now, we know that By ~ N(0, 2). Hence, B2/./2 ~ N (0, 1), and 


P(B < —0.5) = P(Bo/V2 < —0.5/ 4/2) = ®(-0.5/V2). 


Chapter 11: Advanced Topic — Stochastic Processes 661 


Similarly, B5 — B2 ~ N (0, 3), so (Bs — B2)/ v3 ~ N(O, 1), and 


P(Bs — By > 1.5) = P((Bs — By)/V3 > 1.5/V3) 
1 — P((Bs — By) /V3 < 1.5/3) = O(1.5/V3). 


We conclude that 


P(B < —0.5, Bs — By > 1.5) = P(B < —0.5) P(Bs — B2 > 1.5) 
©(-0.5/V2) @(1.5/73) = 0.292. 


Thus, about 29% of the time, Brownian motion will be no more than —1/2 at time 2 
and will then increase by at least 1.5 between time 2 and time 5. E 


We note also that, because Brownian motion was created from simple random 
walks with p = 1/2, it follows that Brownian motion is a martingale. This implies 
that E(B,;) = 0 for all t, but of course, we already knew that because B, ~ N (0, t). 
On the other hand, we can now use the optional stopping theorem (Theorem 11.4.2) to 
conclude that E (Br) = 0, where T is a stopping time (provided, as usual, that either 
T or {B; : t < T} is bounded). This allows us to compute certain probabilities, as 
follows. 


EXAMPLE 11.5.5 
Let {B;} be Brownian motion. Let c < 0 < b. What is the probability the process will 
hit c before it hits b? 

To solve this problem, we let te be the first time the process hits c, and tp be the 
first time the process hits b. We then let T = min {t¢, Tp} be the first time the process 
either hits c or hits b. The question becomes, what is P(te < tb)? Equivalently, what 
is P(Br =c)? 

To solve this, we note that we must have E(Br) = Bo = 0. Butifh = P(Br =c), 
then Br = c with probability h, and Br = b with probability 1 — A. Hence, we must 
have 0 = E(Br) = hc + (1 — h)b, so that h = b/(6 — c). We conclude that 


P(Br =c)=P(te < th) = hae 


(Recall that c < 0, so that b — c = |b| + |[c| here.) E 

Finally, we note that although Brownian motion is a continuous function, it turns 
out that, with probability one, Brownian motion is not differentiable anywhere at all! 
This is part of the reason that Brownian motion is not a good model for the movement of 
real particles. (See Challenge 11.5.15 for a result related to this.) However, Brownian 
motion has many other uses, including as a model for stock prices, which we now 
describe. 


11.5.3 | Diffusions and Stock Prices 


Brownian motion is used to construct various diffusion processes, as follows. 
Given Brownian motion {B;}, we can let 


X;=a+ot+oB;,, 


662 Section 11.5: Brownian Motion 


where a and ô are any real numbers, and ø > 0. Then {X;} is a diffusion. 

Here, a is the initial value, ô (called the drift) is the average rate of increase, and o 
(called the volatility parameter) represents the amount of randomness of the diffusion. 

Intuitively, X; is approximately equal to the linear function a + ôt, but due to 
the randomness of Brownian motion, X; takes on random values around this linear 
function. 

The precise distribution of X, can be computed, as follows. 


Theorem 11.5.3 Let {B;} be Brownian motion, and let X; = a + ôt + o B; bea 
diffusion. Then 
(a) E(X) =4 ap ôt, 


(b) Var(X;) = ot, 
(c) X; ~ N(a + ôt, 071). 


PROOF | We know B; ~ N(0, 1), so E(B:) = 0 and Var(B;) = t. Also, a + ôt is 


not random (i.e., is a constant from the point of view of random variables). Hence, 
E(X) = E(a + ôt +0oB,) =a +ôt +0E(B) =a + ot, 


proving part (a). 
Similarly, 


Var(X;) = Var(a + ôt + o B;) = Var (o B;) = o° Var(B;) = o°t, 


proving part (b). 
Finally, because X; is a linear function of the normally distributed random variable 
Bı, X; must be normally distributed by Theorem 4.6.1. This proves part (c). E 


Diffusions are often used as models for stock prices. That is, it is often assumed 
that the price X, of a stock at time ¢ is given by X; = a + ôt + o B; for appropriate 
values of a, ô, and o. 


EXAMPLE 11.5.6 
Suppose a stock has initial price $20, drift of $3 per year, and volatility parameter 1.4. 
What is the probability that the stock price will be over $30 after two and a half years? 
Here, the stock price after t years is given by X; = 20 + 3t + 1.4B; and is thus a 
diffusion. 
So, after 2.5 years, we have X25 = 20 + 7.5 + 1.4B2.5 = 27.5 + 1.4B2.5. Hence, 


P(X25 > 30) = P(27.5+1.4B2.5 > 30) = P(B2.5 > (30 — 27.5)/1.4) 
= P(B25 > 1.79). 
But like before, 
P(Bo5 > 1.79) = 1— P(B25 < 1.79) = 1 — P (B25/42.5 < 1.79/V2.5) 


1 — ®(1.79//2.5) = 0.129. 


We conclude that P (X2.5 > 30) = 0.129. 


Chapter 11: Advanced Topic — Stochastic Processes 663 


Hence, there is just under a 13% chance that the stock will be worth more than $30 
after two and a half years. E 


EXAMPLE 11.5.7 

Suppose a stock has initial price $100, drift of —$2 per year, and volatility parameter 

5.5. What is the probability that the stock price will be under $90 after just half a year? 
Here, the stock price after ¢ years is given by X; = 100 — 2t + 5.5B; and is again 

a diffusion. So, after 0.5 years, we have X05 = 100 — 1.0 + 5.5Bọ.5 =99+5.5Bo5. 

Hence, 


P(Xo.s <90) = P(99+5.5Bo.5 < 90) = P(Bos < (90 —99)/5.5) 
= P(Bos < —1.64) = P(Bos/V0.5 < —1.64/V0.5) 
®(—1.64//0.5) = ©(—2.32) = 0.010. 


Therefore, there is about a 1% chance that the stock will be worth less than $90 after 
half a year. E 

More generally, the drift ô and volatility ø could be functions of the value Xz, 
leading to more complicated diffusions {X;}, though we do not pursue this here. 


Summary of Section 11.5 


e Brownian motion {B;},>0 is created from simple random walk with p = 1/2, by 
speeding up time by a large factor M, and shrinking space by a factor 1// M. 

e Hence, Bo = 0, By ~ N(O, t), and {Br} has independent normal increments with 
B,-—B; ~ N(0,t — s) for0 < s < t, and Cov(B,, B;) = min(s, t), and {B;} is 
a continuous function. 


e Diffusions (often used to model stock prices) are of the form X, = a+06t+o B;. 


EXERCISES 


11.5.1 Consider the speeded-up processes Eal used to construct Brownian motion. 
Compute the following quantities. 

(a) PY =1) 

Hro r=) 

(c) PO = /2) (Hint: Don’t forget that /2 = 2/,/2.) 

(d) P(Y(™ > 1) for M =1, M =2, M =3, and M = 4 

11.5.2 Let {B;} be Brownian motion. Compute P(B; > 1). 

11.5.3 Let {B;} be Brownian motion. Compute each of the following quantities. 
(a) P(B2 = 1) 

(b) P(B3 < —4) 

(c) P(Bo — Bs < 2.4) 

(d) P(B26 — B11 > 9.8) 

(e) P(B26.3 < —6) 

(f) P(B26.3 < 0) 


664 Section 11.5: Brownian Motion 


11.5.4 Let {B;} be Brownian motion. Compute each of the following quantities. 

(a) P(B2 > 1, Bs — By > 2) 

(b) P(Bs < —2, Bi3 — Bs > 4) 

(c) P(Bg4 > 3.2, Bige — Bg.4 > 0.9) 

11.5.5 Let {B;} be Brownian motion. Compute £(B13Bg). (Hint: Do not forget 
part (b) of Theorem 11.5.2.) 

11.5.6 Let {B;} be Brownian motion. Compute E ((B17 — B14)®) in two ways. 

(a) Use the fact that B17 — B14 ~ N (0,3). 

(b) Square it out, and compute E(B?,) — 2 E (B17B14) + E(B},). 

11.5.7 Let {B;} be Brownian motion. 

(a) Compute the probability that the process hits —5 before it hits 15. 

(b) Compute the probability that the process hits —15 before it hits 5. 

(c) Which of the answers to Part (a) or (b) is larger? Why is this so? 

(d) Compute the probability that the process hits 15 before it hits —5. 

(e) What is the sum of the answers to parts (a) and (d)? Why is this so? 

11.5.8 Let X; = 5 + 3t + 2B, be a diffusion (so that a = 5, ô = 3, and o = 2). 
Compute each of the following quantities. 

(a) E(X7) 

(b) Var(Xs.1) 

(c) P(X25 < 12) 

(d) P(X17 > 50) 

11.5.9 Let X; = 10 — 1.5t + 4B,. Compute E (X3X5). 

11.5.10 Suppose a stock has initial price $400 and has volatility parameter equal to 9. 
Compute the probability that the stock price will be over $500 after 8 years, if the drift 
per year is equal to 

(a) $0. 

(b) $5. 

(c) $10. 

(d) $20. 

11.5.11 Suppose a stock has initial price $200 and drift of $3 per year. Compute 
the probability that the stock price will be over $250 after 10 years, if the volatility 
parameter is equal to 

(a) 1. 

(b) 4. 

(c) 10. 

(d) 100. 


PROBLEMS 


11.5.12 Let {B;} be Brownian motion, and let X = 2B3 — 7B5. Compute the mean 
and variance of X. 


11.5.13 Prove that P(B; < x) = P(B; > —x) for any t > 0 and any x € R!. 


Chapter 11: Advanced Topic — Stochastic Processes 665 


CHALLENGES 


11.5.14 Compute P(B; < x |B; = y), where 0 < s < t, and x, y € R!. (Hint: You 
will need to use conditional densities.) 

11.5.15 (a) Let f : R! — R! be a Lipschitz function, i.e., a function for which there 
exists K < oo such that |f œ) — f(y)| < K|x — y| for all x, y e R!. Compute 


pn d e OH 
mM — 
hNO h 


for any t € R!. 
(b) Let {B;} be Brownian motion. Compute 


P (Ges 2x) 


for any t > 0. 

(c) What do parts (a) and (b) seem to imply about Brownian motion? 

(d) It is a known fact that all functions that are continuously differentiable on a closed 
interval are Lipschitz. In light of this, what does part (c) seem to imply about Brownian 
motion? 


DISCUSSION TOPICS 


11.5.16 Diffusions such as those discussed here (and more complicated, varying co- 
efficient versions) are very often used by major investors and stock traders to model 
stock prices. 

(a) Do you think that diffusions provide good models for stock prices? 

(b) Even if diffusions did not provide good models for stock prices, why might in- 
vestors still need to know about them? 


11.6 | Poisson Processes 


Finally, we turn our attention to Poisson processes. These processes are models for 
events that happen at random times T„. For example, T, could be the time of the 
nth fire in a city, or the detection of the nth particle by a Geiger counter, or the nth car 
passing a checkpoint on a road. Poisson processes provide a model for the probabilities 
for when these events might take place. 

More formally, we let a > 0, and let Rj, R2, ... be i.i.d. random variables, each 
having the Exponential(a) distribution. We let Tọ = 0, and for n > 1, 


Tn = Ri + Ro +--+ + Ra. 


The value T, thus corresponds to the (random) time of the nth event. 
We also define a collection of counting variables N;, as follows. For t > 0, we let 


N; = max{n: T, < t}. 


666 Section 11.6: Poisson Processes 


That is, N; counts the number of events that have happened by time ¢. (In particular, 
No = 0. Furthermore, N; = 0 for all £ < 7), i.e., before the first event occurs.) 

We can think of the collection of variables N, for t > 0 as being a stochastic 
process, indexed by the continuous time parameter t > 0. The process {N; : t > O} is 
thus another example, like Brownian motion, of a continuous-time stochastic process. 

In fact, {N; : t > 0} is called a Poisson process (with intensity a). This name comes 
from the following. 


Theorem 11.6.1 For any ¢ > 0, the distribution of N; is Poisson(at). 


PROOF | See Section 11.7 for the proof of this result. I 


In fact, even more is true. 


Theorem 11.6.2 Let 0 = tọ < ti <h <t <--- < tg. Then fori =1,2,...,d, 
the distribution of N; — N;,_, is Poisson (a(t; — ti—1)). Furthermore, the random 


variables Na — N;,_,, fori = 1, ...,d, are independent. 


PROOF | See Section 11.7 for the proof of this result. I 
EXAMPLE 11.6.1 
Let {N;} be a Poisson process with intensity a = 5. What is P(N3 = 12)? 
Here, N3 ~ Poisson(3a) = Poisson(15). Hence, from the definition of the Poisson 
distribution, we have 
P(N3 = 12) = e™!5(15)!? / 12! = 0.083, 
which is a little more than 8%. E 
EXAMPLE 11.6.2 


Let {N;} be a Poisson process with intensity a = 2. What is P (N6 = 11)? 
Here Ne ~ Poisson(6a) = Poisson(12). Hence, 


P(N6 = 11) = e77(12)!!/11! = 0.114, 


or just over 11%. E 

EXAMPLE 11.6.3 

Let {N;} be a Poisson process with intensity a = 4. What is P(N2 = 3, N5 = 4)? 
(Recall that here the comma means “and” in probability statements.) 

We begin by writing P(N2 = 3, Ns = 4) = P(N) = 3, Ns — No = 1). This 
is just rewriting the question. However, it puts it into a context where we can use 
Theorem 11.6.2. 

Indeed, by that theorem, N2 and Ns — N3 are independent, with N2 ~ Poisson(8) 
and Ns — N2 ~ Poisson(12). Hence, 


P(N2 =3, Ns =4) = P(N2=3, N5 —N2 =1) 
= P(N2 =3) P(Ns — No = 1) 
3 1 
ee pode: 
3! 1! 
We thus see that the event {N2 = 3, Ns = 4} is very unlikely in this case. E 


= 0.0000021. 


Chapter 11: Advanced Topic — Stochastic Processes 667 


Summary of Section 11.6 


e Poisson processes are models of events that happen at random times 7). 


e It is assumed that the time R, = T, — T,—1 between consecutive events in 
Exponential(a) for some a > 0. Then N; represents the total number of events 
by time t. 


e It follows that N, ~ Poisson(at), and in fact the process { N;}+>9 has independent 
increments, with N; — Ns ~ Poisson(a(t — s)) forO < s < t. 


EXERCISES 


11.6.1 Let {N (t)}r>0 be a Poisson process with intensity a = 7. Compute the follow- 


ing probabilities. 
(a) P(N2 = 13) 
(b) P(Ns = 3) 


(c) P(N6 = 20). 

(d) P(Nso = 340) 

(e) P(N) = 13, N5 = 3). 

(f) P(N2 = 13, Ne = 20) 

(g) P(N2 = 13, Ns = 3, No = 20) 

11.6.2 Let {N(¢)},>0 be a Poisson process with intensity a = 3. Compute P(N} /2 
6) and P(No3 = 5). 

11.6.3 Let {N (t)}r>0 be a Poisson process with intensity a = 1/3. Compute P (N2 
6) and P(N3 = 5). 

11.6.4 Let {N(t)};>0 be a Poisson process with intensity a = 3. Compute P(N2 
6, N3 = 5). Explain your answer. 


11.6.5 Let {N (t)};>0 be a Poisson process with intensity a > 0. Compute (with expla- 
nation) the conditional probability P (N2.6 = 2 | N29 =2). 

11.6.6 Let {N(¢)};>0 be a Poisson process with intensity a = 1/3. Compute (with 
explanation) the following conditional probabilities. 

(a) PWs =5| No = 5) 

(b) P(No = 5| No = 7) 

(c) P(No =5| No = 7) 

(d) P(No =7| No = 7) 

(e) P(No = 12| No = 7) 


PROBLEMS 


11.6.7 Let {N; : t > 0} be a Poisson process with intensity a > 0. LetO < s < t, and 
let j be a positive integer. 

(a) Compute (with explanation) the conditional probability P(N; = j |N; = j). 

(b) Does the answer in part (a) depend on the value of the intensity a? Intuitively, why 
or why not? 

11.6.8 Let {N; : t > 0} be a Poisson process with intensity a > 0. Let Tı be the time 
of the first event, as usual. LetO < s < t. 


668 Section 11.7: Further Proofs 


(a) Compute P(N; = 1| NM; = 1). (If you wish, you may use the previous problem, 
with j = 1.) 
(b) Suppose ¢ is fixed, but s is allowed to vary in the interval (0, t). What does the an- 
swer to part (b) say about the “conditional distribution” of Tı, conditional on knowing 
that N, t= 1? 


11.7 | Further Proofs 
Proof of Theorem 11.1.1 


We want to prove that when {Xn} is a simple random walk, n is a positive integer, and 
if k is an integer such that —n < k < n andn +k is even, then 


n = 
P(X, =a +k) = ( nt) per? g@-b/2, 
2 


For all other values of k, we have P(X; =a +k) = 0. Furthermore, 
E(X,) =a+n(Q2p — 1). 


Of the first n bets, let W, be the number won, and let L, be the number lost. Then 
n = Wa + Ln. Also, Xn =a + Wn — Ln. 

Adding these two equations together, we conclude that n + Xn = Wn + Ln +a + 
Wn — Ln = a + 2Wn. Solving for W,, we see that W, = (n + Xn — a)/2. Because 
W,, must be an integer, it follows that n + X, — a must be even. We conclude that 
P(X, =a + k) = 0 unless n + k is even. 

On the other hand, solving for X„, we see that X, = a +2W, — n, or Xn — 
a = 2W, — n. Because 0 < W, < n, it follows that —n < X, —a < n, i.e., that 
P(X, =a +k)=0ifk < -nork >n. 

Suppose now that k + n is even, and —n < k < n. Then from the above, P (Xn = 
a +k) = P(W, = (n + k)/2). But the distribution of W, is clearly Binomial (n, p). 
We conclude that 


n 
P(Xn =a +k) = Oe ae 
2 


provided that k + n is even and —n < k <n. 

Finally, because W, ~ Binomial(n, p), therefore E(W,,) = np. Hence, because 
Xn = a+2W,,—n, therefore E (Xn) =a+2E(W,)-—n =a+2np—n =a+n(2p-1), 
as claimed. E 


Proof of Theorem 11.1.2 


We want to prove that when {Xn} is a simple random walk, with some initial fortune a 
and probability p of winning each bet, and0 < a < c, then the probability P(t < to) 


Chapter 11: Advanced Topic — Stochastic Processes 669 


of hitting c before 0 is given by 


a/c p=1/2 
P(te <to)=4 '-(4) p 41/2. 


To begin, let us write s (b) for the probability P (te < to) when starting at the initial 
fortune b, for any 0 < b < c. We are interested in computing s(a). However, it turns 
out to be easier to solve for all of the values s (0), s (1), s(2), . . . , s (c) simultaneously, 
and this is the trick we use. 

We have by definition that s (0) = 0 (i.e., if we start with $0, then we can never 
win) and s(c) = 1 (e., if we start with $c, then we have already won). So, those two 
cases are easy. However, the values of s(b) for 1 < b < c — 1 are not obtained as 
easily. 

Our trick will be to develop equations that relate the values s (b) for different values 
of b. Indeed, suppose 1 < b < c — 1. It is difficult to compute s (b) directly. However, 
it is easy to understand what will happen on the first bet — we will either lose $1 with 
probability p, or win $1 with probability q. That leads to the following result. 


Lemma 11.7.1 For 1 < b < c—1, we have 


s(b) = ps(b + 1) + qs(6 — 1). (11.7.1) 


Suppose first that we win the first bet, i.e., that Z; = 1. After this first 
bet, we will have fortune b + 1. We then get to “start over” in our quest to reach c 
before reaching 0, except this time starting with fortune b + 1 instead of b. Hence, after 
winning this first bet, our chance of reaching c before reaching 0 is now s(b + 1). (We 
still do not know what s(6 + 1) is, but at least we are making a connection between 
s(b) and s(6 + 1).) 

Suppose instead that we lose this first bet, i.e., that Z1 = —1. After this first bet, 
we will have fortune b — 1. We then get to “start over” with fortune b — 1 instead of b. 
Hence, after this first bet, our chance of reaching c before reaching 0 is now s(b — 1). 

We can combine all of the preceding information, as follows. 


s(b) P(te < t0) 
P(Zi =1,te < t0) + P(Z1 = —1, Te < To) 


ps(b +1) + 9s(6 - 1) 


That is, s(b) = ps(b+ 1) + q s(b — 1), as claimed. E 


So, where are we? We had c + 1 unknowns, s (0), s(1),...,5(c). We now know 
the two equations s(0) = 0 and s(c) = 1, plus the c — 1 equations of the form s(b) = 
ps(b+1)+qs(6—-1) forb =1,2,...,c—1. In other words, we have c+ 1 equations 
in c + 1 unknowns, so we can now solve our problem! 

The solution still requires several algebraic steps, as follows. 


670 Section 11.7: Further Proofs 


Lemma 11.7.2 For 1 < b < c—1, we have 


s(b +1) —s(b) = LeO Srl) 


PROOF | Recalling that p +q = 1, we rearrange (11.7.1) as follows. 


s(b) = ps(b+1)+qs(—-1) 
(p+q)s(b) = pstb+1)+qs-1) 
q(s(b) -s(—-1)) = pl(sO+1)—-s()) 


And finally, 
s(b+1) —s(b) = 50) ~s(b—1)), 


which gives the result. E 


Lemma 11.7.3 For 0 < b < c, we have 


b-1 i 
sb)=> (2) s(1). (11.7.2) 


i=0 


PROOF | Applying the equation of Lemma 11.7.2 with b = 1, we obtain 
_ 4 _ 4 
s(2) —s(1) = —(s(1) — s(0)) = —s(1) 
P P 
(because s(0) = 0). Applying it again with b = 2, we obtain 


2 2 
s(3) —s(2) = LeO s= (2) (s(1) -s (0)) = (2) s(1). 


By induction, we see that 


b 
1641) E E (2) 3th 
P 


for b = 0,1,2,...,c — 1. Hence, we compute that for b = 0,1,2,...,c, 


s(b) 

= (s(6) —s( — 1)) + (s@ — 1) -s -2)) + 6 — 2) —s—3)) +- 
+ (s(1) — s(0)) 
b-1 b-1 q i 

=> 60+)-s@™)=>¢ (2) s(1). 
i=0 i=0 P 


This gives the result. E 


Chapter 11: Advanced Topic — Stochastic Processes 671 


We are now able to finish the proof of Theorem 11.1.2. 

If p = 1/2, then g/p = 1, so (11.7.2) becomes s(b) = bs (1). But s (c) = 1, so we 
must have cs(1) = 1, i.e., s(1) = 1/c. Then s(6) = bs(1) = b/c. Hence, s(a) = a/c 
in this case. 

If p Æ 1/2, then g/p ¥ 1, so (11.7.2) is a geometric series, and becomes 


i @/p)-1 
q/P)— 
1 = —. 
an Q/p -1 
Then 


q/p} — PE (q/p)? -1 @/p)-1 _ (a/p -1 


b) = = = ; 
eo ea Te (q/p)—-1 @/py—1 @/py 1 


Hence, 
q/p)* -1 


Oe (q/p)° —1 


in this case. E 


Proof of Theorem 11.1.3 


We want to prove that when {Xn} is a simple random walk, with initial fortune a > 0 
and probability p of winning each bet, then the probability P (to < œœ) that the walk 
will ever hit 0 is given by 


1 p<ij2 
(q/p)* p > 1/2. 


By continuity of probabilities, we see that 


P(to < œ) = | 


P(to < œ) = lim P(to < te) = lim (1 — P (te < T9)). 
coo c>0O 
Hence, if p = 1/2, then P(tg < 00) = lime_4o (1 — a/c) = 1. 
Now, if p 4 1/2, then 


~ 1=(q/pye 


Ifp < 1/2, theng/p > 1, solime_4o0(¢/p)° = œ, and P(to < wo) = 1. If p > 1/2, 
then q/p < 1, so lime_,o0(¢/p)* = 0, and P(t9 < œ) = (q/p)*.0 


P(t < 00) = lim (: i) 


672 Section 11.7: Further Proofs 


Proof of Theorem 11.3.3 


We want to prove that the Metropolis—Hastings algorithm results in a Markov chain 
Xo, X1, X2,..., which has {x;} as a stationary distribution. 


We shall prove that the resulting Markov chain is reversible with respect to {7 ;}, 
i.e., that 


mi P(Xn41 =J |Xn =i) = aj P(X =i |Xr = J), (11.7.3) 


fori, j e S. It will then follow from Theorem 11.2.6 that {7;} is a stationary distribu- 
tion for the chain. 

We thus have to prove (11.7.3). Now, (11.7.3) is clearly true if i = j, so we can 
assume that i Æ j. 

But ifi Æ 7, and X, = i, then the only way we can have Xn41 = J is if Yn41 = j 
(i.e., we propose the state j, which we will do with probability p;;). Also we accept 
this proposal (which we will do with probability a;;). Hence, 


T ;qji i T iQ ji 
P(Xn41 = J | Xn =1) = Gijaij = qij min f1, zaar] = min fav, seat | s 
Tiqij Ti 
It follows that z; P (Xn+1 = J | Xn =i) = min{z ;qij, 7 jQji}- 
Similarly, we compute that æ; P(Xn41 = i|Xn = j) = min{z ;qji, wigi;}. It 
follows that (11.7.3) is true. E 


Proof of Theorem 11.5.1 


We want to prove that when {Y, Mm) : t > 0} is as defined earlier, then for large M: 


(a) For t > 0, the distribution of yn is approximately N(0, t), i.e., normally dis- 
tributed with mean t. 
(b) For s,t > 0, the covariance 


Cov an"? yf) 


is approximately equal to min {s, t}. 
(c) For t > s > 0, the distribution of the increment 


Y, A S Ky: Ga 


is approximately N (0, t — s), i.e., normally distributed with mean 0 and variance t —s, 
and is approximately independent of ye. 
(d) yy is a continuous function of t. 

Write |r | for the greatest integer not exceeding r, so that, e.g., 17.6] = 7. Then we 
see that for large M, t is very close to [tM] / M, so that Y, nM) is very close (formally, 
within O(1/M) in probability) to 


w l 
E ae me AEE 


Chapter 11: Advanced Topic — Stochastic Processes 673 


Now, A is equal to 1/./M times the sum of |t M | different i.i.d. random variables, 
each having mean 0 and variance 1. It follows from the central limit theorem that A 
converges in distribution to the distribution N(0, t) as M — oo. This proves part (a). 

For part (b), note that also Y“ is very close to 

1 
B= Vie al = war! +Z2+-°-+ Z\sM]): 
Because E (Z;) = 0, we must have F(A) = E(B) = 0, so that Cov(4, B) = E(AB). 

For simplicity, assume s < t; the case s > ¢ is similar. Then we have 


Cov(4, B) = E(AB) 


1 
= E (Zit Z2 +--+ Zem )(Zi + Z2 +--+ Zum) 


1 LsM] [tM] | sM] M] 
= —E ZiZ; |= — E(Z;Z;). 
eZ aa) ra Bee 
Now, we have £(Z;Z;) = 0 unlessi = j, in which case E(Z;Z;) = 1. There will 
be precisely |sMJ| terms in the sum for which i = j, namely, one for each value of i 
(since £ > s). Hence, 
[sM] 
M >’ 
which converges to s as M — oo. This proves part (b). 
Part (c) follows very similarly to part (a). Finally, part (d) follows because the 


Cov(A, B) = 


function y was constructed in a continuous manner (as in Figure 11.5.1). E 


Proof of Theorem 11.6.1 


We want to prove that for any t > 0, the distribution of N; is Poisson(at). 
We first require a technical lemma. 


Lemma 11.7.4 Let g,(t) = e~“a"t"—!/(n — 1)! be the density of the 
Gamma(n, a) distribution. Then for n > 1, 


t o0 
n(s)ds = Bare ay inl 11.7.4 
f ods =X ean (11.7.4) 


tS 


If ¢ = 0, then both sides are 0. For other ¢, differentiating with respect to 
t, we see (setting j = i — 1) that £ PP, e% (at) /i! = YP, (-ae (at)! /i! + 
e“ aiti—! j(i -1)) = DL, (—e Malt til) Fd Deya e™%ai+t!ti jj! = 
et g@—D+lpn—1 (m1)! = g, (t) = 2 f 2n(s)ds. Because this is true for all £ > 0, 
we see that (11.7.4) is satisfied for any n > 0.8 

Recall (see Example 2.4.16) that the Exponential(/) distribution is the same as the 


Gamma(1, 4) distribution. Furthermore, (see Problem 2.9.15) if X ~ Gamma(a1, 4) 
and Y ~ Gamma(az, 4) are independent, then X + Y ~ Gamma(a; + a2, 2). 


674 Section 11.7: Further Proofs 


Now, in our case, we have 7, = Rj + Ro +- - -+ Rn, where R; ~ Exponential(a) = 
Gamma(1, a). It follows that T, ~ Gamma(n, a). Hence, the density of Ta is gn(t) = 
ald ike ae (a 

Now, the event that N; > n (i.e., that the number of events by time t is at least n) is 
the same as the event that T, < t (i.e., that the nth event occurs before time n). Hence, 


t 
P(N; =n) = P(In <t) S gn(s)ds. 
0 
Then by Lemma 11.7.4, 


P(N; =n) = Se (11.7.5) 


i=n 


for any n > 1. Ifn = 0, then both sides are 1, so in fact (11.7.5) holds for any n > 0. 
Using this, we see that 


P(N, =f) = P(N 2 J) - P(N: 27+) 


= (Serena) — ( x en's) =e “(at) /j!. 
i=j 


i=j+l 


It follows that N; ~ Poisson(at), as claimed. E 


Proof of Theorem 11.6.2 


We want to prove that when Q = tọ < ti <b <t3 <--- < tq, thenfori =1,2,...,d, 
the distribution of N, — N,_, is Poisson(a(t; — t;-1)). Furthermore, the random 
variables N, — N;_,, fori =1,...,d, are independent. 


From the memoryless property of the exponential distributions (see Problem 2.4.14), 
it follows that regardless of the values of Ns for s < ti—1, this will have no effect on 
the distribution of the increments N; — N;,_, for t > t;-1. That is, the process {N;} 
starts fresh at each time ¢;_1, except from a different initial value N;,_, instead of from 
No = 0. 

Hence, the distribution of N;,_,41 —Ni,_, for u > 0 is identical to the distribution 
of Nu — No = N, and is independent of the values of Ns for s < t;-1. Because we 
already know that N, ~ Poisson(aw), it follows that N;,_,+1. — Ni,_, ~ Poisson(au) 
as well. In particular, N, — N;,_, ~ Poisson(a (t; — t;-1)) as well, with N, — Ns; 
independent of {N; : s < ti—1}. The result follows. I 


Appendix A 
Mathematical Background 


To understand this book, it is necessary to know certain mathematical subjects listed 
below. Because it is assumed the student has already taken a course in calculus, topics 
such as derivatives, integrals, and infinite series are treated quite briefly here. Multi- 
variable integrals are treated in somewhat more detail. 


A.1 | Derivatives 


From calculus, we know that the derivative of a function f is its instantaneous rate of 
change: 


fe +h) — SE) 


/ d : 
Jœ) = z0 = jim A 


In particular, the reader should recall from calculus that 


dz_ d |3 _ 2 d vn n—-l 
z> = 9, ae = 3x", Re Sx", 
Cs ee 5 da = d ae 
ax © =e, ae anx = cCOsx, dx cosx = anx, 


etc. Hence, if f(x) = x?, then f'(x) = 3x? and, e.g., f'(T) = 3 7? = 147. 
Derivatives respect addition and scalar multiplication, so if f and g are functions 
and C is a constant, then 


d 
zE f(x) +2) =C fE E). 
Thus, 
d 3 2 2 
z” —3x* + 7x +12) = 15x* — 6x +7, 
etc. 


675 


676 Appendix A: Mathematical Background 


Finally, derivatives satisfy a chain rule; if a function can be written as a composition 
of two other functions, as in f(x) = g(h(x)), then f’(x) = g’(h(x)) h'(x). Thus, 


fe = 52%, 
4 sin(x?) = 2x cos(x?), 
La? 4x3) = 5? +. x3)4(2x +327), 


etc. 
Higher-order derivatives are defined by 


d d 
fœ) = nl © f"@) = al ©: 


etc. In general, the rth-order derivative f (x) can be defined inductively by f(x) = 
f(x) and 


d 
FOR) =z 
forr > 1. Thus, if f(x) = x4, then f’(x) = 4x3, f"(x) = fO (x) = 12x, FO) = 


24x, f ® (x) = 24, ete. 
Derivatives are used often in this text. 


A.2 | Integrals 


If f is a function, anda < b are constants, then the integral of f over the interval 


[a, b], written j 
[ forex. 


represents adding up the values f(x), multiplied by the widths of small intervals around 
x. That is, ig f(Q)dx © ye fDi —xXj-1), where a = x9 < x1 <... < Xd =b 
and where x; — x;_1 is small. 

More formally, we can set x; = a + (i/d)(b — a) and let d > on, to get a formal 
definition of integral as 


b d 
y J@ dx = jim Fe + G/D =a) 0/4) 


To compute f f(x)dx in this manner each time would be tedious. Fortunately, the 
fundamental theorem of calculus provides a much easier way to compute integrals. It 


says that if F (x) is any function with F’(x) = f(x), then f f(x)dx = F(b) — F(a). 
Hence, 

f 3x2 dx = b? — 8°, 

f x? dx = T? aw), 


b — l n+l n+1 
Ji x” dx = zg OT Sa), 


Appendix A.3: Infinite Series 677 
and 

f cosx dx = sinb — sina, 

É sinx dx = —(cosb — cosa), 


f e™ dx = Le” = eñ’). 


A.3 | Infinite Series 


If aj, a2, a3, ... is an infinite sequence of numbers, we can consider the infinite sum 
(or series) 


OO 
Š ai =a +a ta 
i=l 


Formally, $72; a; = limy—oo SA 1 4. This sum may be finite or infinite. 


For example, clearly (°°, 1 = 1+1+1+1+--- = oo. On the other hand, 
because 
1 1 1 1 1 2” -1 
ae ae a a + on = an? 
we see that 
l 1 1 1l ea] el eee 
pate ig OL oe 


More generally, we compute that 


whenever |a| < 1. 

One particularly important kind of infinite series is a Taylor series. If f is a func- 
tion, then its Taylor series is given by 

; l 2n l 3 6) Slo 
FO xf O+ Sx fO + He OO H = D i OO). 
2! 3! ii! 

(Here i! = i(i — 1)(i — 2) - - - (2)(1) stands for i factorial, with 0! = 1! = 1, 2! = 2, 
3! = 6,4! = 24, etc.) Usually, f (x) will be exactly equal to its Taylor series expansion, 


thus, 
sinx = LEIBA SAK Takes, 
cosx = 1x7 /2404 49°75 toss, 
e = L4extx2/tx3/3t4xt/4r+..-, 
e™ = 14 5x + (5x)?/2! + (5x)? /3! + 6x)t/4 +--+, 


etc. If f(x) is a polynomial (e.g., f(x) = x? — 3x? + 2x — 6), then the Taylor series 
of f(x) is precisely the same function as f(x) itself. 


678 Appendix A: Mathematical Background 


A.4| Matrix Multiplication 


A matrix is any r x s collection of numbers, e.g., 


3 =l 
4=($ a) E : ae C=| 3/5 2/5 |, 
—0.6 —17.9 


etc. 

Matrices can be multiplied, as follows. If A is anr xs matrix, and B is ans xu ma- 
trix, then the product 4B is an r x u matrix whose i, j entry is given by }°;_; Aik Buy, 
a sum of products. For example, with A and B as above, if M = AB, then 


(5 2)(4 60) 


_ Ce 8 (6) +6 (6) oes 84 a 


M 


5(3) +2(-7) 5(6)+2(6) 5(2)+2(0) 1 42 10 


as, for example, the (2, 1) entry of M equals 5 (3) + 2(—7) = 1. 
Matrix multiplication turns out to be surprisingly useful, and it is used in various 
places in this book. 


A.5 | Partial Derivatives 


Suppose f is a function of two variables, as in f(x, y) = 3x?y?. Then we can take a 
partial derivative of f with respect to x, writing 


ð 
Ea X, > 
Ag) OY) 
by varying x while keeping y fixed. That is, 


: fœ thy Hf Oy) 
100 0 ee 


fe) 
— == l 
ox IY) h>0 h 


This can be computed simply by regarding y as a constant value. For the example 
above, 


ô 
— (3x? y?) = 6xy>, 
Ox 
Similarly, by regarding x as constant and varying y, we see that 
ð 
— (3x73) = 9x7 y*, 
oy 
Other examples include 


18ye? + 6x>y8, 


a 
5, ibe” + x°y8 — sin(y>)) 


18xe*” + 8x°y7 — 3y? sin(y?), 


0 s 
ayo + x8 — sin(y3)) 


Appendix A.6: Multivariable Integrals 679 


etc. 
If f is a function of three or more variables, then partial derivatives may similarly 
be taken. Thus, 


ð ð ð 
L (x27426) = 2xy4z9, £ (x2425) = 4x? 326, £ (x2 y426) = 6x2y425, 
Ox oy Oz 


etc. 


A.6 | Multivariable Integrals 


If f is a function of two or more variables, we can still compute integrals of f. How- 
ever, instead of taking integrals over an interval [a,b], we must take integrals over 
higher-dimensional regions. 

For example, let f(x,y) = x?y?, and let R be the rectangular region given by 
R={0<x<1,5<y<7}=[0, 1] x [5,7]. What is 


J | tevaray, 


the integral of f over the region R? In geometrical terms, it is the volume under the 
graph of f (and this is a surface) over the region R. But how do we compute this? 
Well, if y is constant, we know that 


1 1 
Í 
f S(x,y) dx sf xy? dx = y (A.6.1) 
0 0 


This corresponds to adding up the values of f along one “strip” of the region R, where 
y is constant. In Figure A.6.1, we show the region on integration R = [0,1] x [5,7]. 
The value of (A.6.1), when y = 6.2, is (6.2)? /3 = 79.443; this is the area under the 
curve x? (6.2)° over the line [0, 1] x {6.2}. 


> 
a x 


Figure A.6.1: Plot of the region of integration (shaded) R = [0, 1] x [5, 7] together with the 
line at y = 6.2. 


680 Appendix A: Mathematical Background 


If we then add up the values of the areas over these strips along all different possible 
y values, then we obtain the overall integral or volume, as follows: 


f | to.nasay = [ (f ræna)w= f (| a)i 


| 
a, 
R_, 
[Re 
x 
U 
Ne 
a 
< 
| 
Wile 
Ale 
~ 
N 
A 
| 
Nn 
iS 
wm 
II 
— 
N 
[0e] 


So the volume under the the graph of f and over the region R is given by 148. 
Note that we can also compute this integral by integrating first y and then x, and 


we get the same answer: 
1 q 1 7 
f (| tena)a=[ (f 2 a)a 
0 5 0 5 


J [teaa 
if Gre = 5^) dx = ; Ch _ 54) = 148, 


Nonrectangular Regions 


If the region R is not a rectangle, then the computation is more complicated. The idea 
is that, for each value of x, we integrate y over only those values for which the point 
(x, y) is inside R. 

For example, suppose that R is the triangle given by R = {(x,y) : 0 < 2y < 
x < 6}. In Figure A.6.2, we have plotted this region together with the slices at x = 3 
and y = 3/2. We use the x-slices to determine the limits on y for fixed x when we 
integrate out y first; we use the y-slices to determine the limits on x for fixed y when 
we integrate out x first. 


Figure A.6.2: The integration region (shaded) R = {(x, y): 0 < 2y < x < 6} together with 
the slices atx = 3 andy = 3/2. 


Appendix A.6: Multivariable Integrals 681 


Then x can take any value between 0 and 6. However, once we know x, then y can 
only take values between 0 and x /2. Hence, if f(x, y) = xy +.x°y8, then 


f | 1edd 
-f (4 f(s. y)dy)as -f (S eta) dx 


6 
= | (36/2 -+ 65 G2” - 0") as 
0 


= 1 (Gs i z") i 


11 1 1 

oe -0 + ——(6!6 —0!6 
gq! ) + F608 16‘ ) 

= 3.8264 x 10’. 


Once again, we can compute the same integral in the opposite order, by integrating 
first x and then y. In this case, y can take any value between 0 and 3. Then, for a given 
value of y, we see that x can take values between 0 and 2y. Hence, 


[ [renea =f ([ renaja f (S r+ ax)av 


We leave it as an exercise for the reader to finish this integral, and see that the same 
answer as above is obtained. 

Functions of three or more variables can also be integrated over regions of the 
corresponding dimension three or higher. For simplicity, we do not emphasize such 
higher-order integrals in this book. 


Appendix B 
Computations 


We briefly describe two computer packages that can be used for all the computations 
carried out in the text. We recommend that students familiarize themselves with at 
least one of these. The description of R is quite complete, at least for the computations 
based on material in this text, whereas another reference is required to learn Minitab. 


B.1 | Using R 


R is a free statistical software package that can be downloaded and installed on your 
computer (see http://cran.r-project.org/). A free manual is also available at this site. 

Once you have R installed on your system, you can invoke it by clicking on the 
relevant icon (or, on Unix systems, simply typing “R”). You then see a window, called 
the R Console that contains some text and a prompt ‘ > ’ after which you type com- 
mands. Commands are separated by new lines or ‘ ; °. Output from commands is also 
displayed in this window, unless it is purposefully directed elsewhere. To quit R, type 
q() after the prompt. To learn about anything in R, a convenient resource is to use 
Help on the menu bar available at the top of the R window. Alternatively, type ?name 
after the prompt (and press enter) to display information about name, e.g., ?q brings 
up a page with information about the terminate command q. 


Basic Operations and Functions 
A basic command evaluates an expression, such as 


> 243 
[t145 


which adds 2 and 3 and produces the answer 5. Alternatively, we could assign the value 
of the expression to a variable such as 


> a <- 2 


where <- (less than followed by minus) assigns the value 2 to a variable called a. 
Alternatively, = can be used for assignment as in a = 2, but we will use <-. We 


683 


684 Appendix B: Computations 


can then verify this assignment by simply typing a and hitting return, which causes the 
value of a to be printed. 


> a 
LEN -2 


Note that R is case sensitive, so A would be a different variable than a. There are some 
restrictions in choosing names for variables and vectors, but you won’t go wrong if you 
always start the name with a letter. 

We can assign the values in a vector using the concatenate function c () such as 


Sb eae (11 1.3475) 
> b 
Pde) ala ae 3: 4.5 


which creates a vector called b with six values in it. We can access the ith entry in a 
vector b by referring to it as b [i]. For example, 


> b[3] 
EEP 


prints the third entry in b, namely, 1. Alternatively, we can use the scan command to 
input data. For example, 


> b <- scan() 
Ts. “dy Dat, 3-45 
Ti 

Read 6 items 

> b 

AI Ta -L 3-4- 


accomplishes the same assignment. Note that with the scan command, we simply 
type in the data and terminate data input by entering a blank line. We can also use 
scan to read data in from a file, and we refer the reader to ? scan for this. 

Sometimes we want vectors whose entries are in some pattern. We can often use 
the rep function for this. For example, x <- rep (1,20) creates a vector of 20 
ones. More complicated patterns can be obtained, and we refer the reader to ? rep for 
this. 

Basic arithmetic can be carried out on variables and vectors using + (addition), - 
(subtraction), * (multiplication), / (division), and ^ (exponentiation). These operations 
are carried out componentwise. For example, we could multiply each component of b 
by itself via 


> b*b 
EL 1L Eg 6 25 


or multiply each element of b by 2 as in 


DA: 
[1] 22 2 6 8 10 


which accomplishes this. 


Appendix B.1: Using R 685 


There are various functions available in R, such as abs (x) (calculates the absolute 
value of x), Log (x) (calculates the natural logarithm of x), exp (x) (calculates e 
raised to the power x), sin (x), cos (x), tan (x) (which calculate the trigonomet- 
ric functions), sqrt (x) (which calculates the square root of x), ceiling (x), and 
floor (x) (calculate the ceiling and floor of x). When such a function is applied to 
a vector x, it returns a vector of the same length, with the function applied to each 
element of the original vector. There are numerous special functions available in R, but 
two important ones are gamma (x), which returns the gamma function applied to x, 
and 1gamma (x) , which returns the natural logarithm of the gamma function. 

There are also functions that return a single value when applied to a vector. For 
example, min (x) and max (x) return, respectively, the smallest and largest elements 
in x; length (x) gives the number of elements in x; and sum (x) gives the sum of 
the values in x. 

R also operates on logical quantities TRUE (or T for true) and FALSE (or F for 
false). Logical values are generated by conditions that are either true or false. For 
example, 


1] FALSE TRUE TRUE FALSE FALSE 


compares each element of the vector a with 0, returning TRUE when it is greater than 0 
and FALSE otherwise, and these logical values are stored in the vector b. The follow- 
ing logical operators can be used: <, <=, >=, >, == (for equality), != (for inequality) 
as well as & (for conjunction), | (for disjunction) and ! (for negation). For example, if 
we create a logical vector c as follows: 


Se <> eh Tt, Lee) 

> b&c 

[1] FALSE TRUE TRUE FALSE FALS 
> bic 
[1] TRU 


T 


T 


TRUE TRUE TRUE TRUE 


then an element of b&c is TRUE when both corresponding elements of b and c are 
TRUE, while an element of b|c is TRUE when at least one of the corresponding ele- 
ments of b and c is TRUE. 

Sometimes we may have variables that take character values. While it is always 
possible to code these values as numbers, there is no need to do this, as R can also 
handle character-valued variables. For example, the commands 


> A <- c(’a’,'b’) 
> A 
[1] "a" "p" 


create a character vector A, containing two values a and b, and then we print out this 
vector. Note that we included the character values in single quotes when doing the 
assignment. 


686 Appendix B: Computations 


Sometimes data values are missing and so are listed as NA (not available). Opera- 
tions on missing values create missing values. Also, an impossible operation, such as 
0/0, produces NaN (not a number). 

Various objects can be created during an R session. To see those created so far in 
your session, use the command 1s (). You can remove any objects in your workspace 
using the rm command. For example, rm (x) removes the vector x. 


Probability Functions 


R has a number of built-in functions for evaluation of the cdf, the inverse cdf, the 
density or probability function, and generating random samples for the common dis- 
tributions we encounter in probability and statistics. These are distinguished by prefix 
and base distribution names. Some of the distribution names are given in the following 
table. 


beta beta(- a,b) hypergeometric hyper(- ,N,M,n) 
binomial binom(- ,n,p) negative binomial nbinom(- ,k,p) 
chi-squared _ chisq(- ,df) normal norm(- ,mu,sigma) 


exponential  exp(- ,lambda) Poisson pois(- ,lambda) 
F f(- ,df1,df2) t t(- ,df) 

gamma gamma(- ,alpha,lambda) | uniform unif(- ,min,max) 
geometric geom(- ,p) 


As usual, one has to be careful with the gamma distribution. The safest path is 
to include another argument with the distribution to indicate whether or not lambda 
is a rate parameter (density is (T (a))~!A%x%—!e~**) or a scale parameter (density 
is (T (a))~!4-@x%-le-*/*), So gamma (-,alpha, rate=lambda) indicates that 
lambda is a rate parameter, and gamma (-,alpha, scale=lambda) indicates that 
it is a scale parameter. 

The argument given by - is specified according to what purpose the command using 
the distribution name has. To obtain the cdf of a distribution, precede the name by p, 
and then - is the value at which you want to evaluate the cdf. To obtain the inverse cdf 
of a distribution, precede the name by q, and then - is the value at which you want to 
evaluate the inverse cdf. To obtain the density or probability function, precede the name 
by d, and then - is the value at which you want to evaluate the density or probability 
function. To obtain random samples, precede the name by r, and then - is the size of 
the random sample you want to generate. 

For example, 


> x <- rnorm(4,1,2) 
> X 
[1] -0.2462307 2.7992913 4.7541085 3.3169241 


generates a sample of 4 from the N (1, 2) distribution and assigns this to the vector x. 
The command 


Appendix B.1: Using R 687 


> dnorm(3.2,2,.5) 
[1] 0.04478906 


evaluates the N (2, .25) pdf at 3.2, while 


> pnorm(3.2,2,.5) 
[1] 0.9918025 


evaluates the N (2, 0.25) cdf at 3.2, and 


> gqnorm(.025,2,.5) 
[1] 1.020018 


gives the 0.025 quantile of the N(2, 0.25) distribution. 

If we have data stored in a vector x, then we can sample values from x, with or 
without replacement, using the sample function. For example, sample (x,n,T) 
will generate a sample of n from x with replacement, while sample (x,n,F) will 
generate a sample of n from x without replacement (note n must be no greater than 
length (x) in the latter case). 

Sometimes it is convenient to be able to repeat a simulation so the same random 
values are generated. For this, you can use the set .seed command. For example, 
set .seed (12345) establishes the seed as 12345. 


Tabulating Data 


The table command is available for tabulating data. For example, table (x) re- 
turns a table containing a list of the unique values found in x and their frequency of 
occurrence in x. This table can be assigned to a variable via 


y <- table (x) 


for further analysis (see The Chi-Squared Test section on the next page). 

If x and y are vectors of the same length, then table (x, y) produces a cross- 
tabulation, i.e., counts the number of times each possible value of (x, y) is obtained, 
where x can be any of the values taken in x and y can be any of the values taken in y. 


Plotting Data 


R has a number of commands available for plotting data. For example, suppose we 
have a sample of size n stored in the vector x. 

The command hist (x) will provide a frequency histogram of the data where the 
cutpoints are chosen automatically by R. We can add optional arguments to hist. The 
following are some of the arguments available. 


breaks — A vector containing the cutpoints. 
freq — A logical variable; when freq=T (the default), a frequency 


histogram is obtained, and when f reg=F, a density histogram is obtained. 


For example, hist (x, breaks=c (-10,-5,-2,0,2,5,10),freq=F) will plot 
a density histogram with cutpoints —10, —5, —2, 0, 2,5, 10, where we have been care- 
ful to ensure that min (x) > —10 and max (x) < 10. 


688 Appendix B: Computations 


If y is another vector of the same length as x, then we can produce a scatter plot of 
y against x via the command plot (x,y). The command plot (x, y, type="1") 
provides a scatter plot of y against x, but now the points are joined by lines. The 
command plot (x) plots the values in x against their index. The plot (ecdf (x) ) 
command plots the empirical cdf of the data in x. 

A boxplot of the data in x is obtained via the boxplot (x) command. Side-by- 
side boxplots of the data in x, y, z, etc., can be obtained via boxplot (x,y,Z). 

A normal probability plot of the values in x can be obtained using the command 
qqnorm (x). 

A barplot can be obtained using the barplot command. For example, 


> h <- c(1,2,3) 
> barplot (h) 


produces a barplot with 3 bars of heights 1, 2, and 3. 

There are many other aspects to plotting in R that allow the user considerable con- 
trol over the look of plots. We refer the reader to the manual for more discussion of 
these. 


Statistical Inference 


R has a powerful approach to fitting and making inference about models. Models are 
specified by the symbol ~. We do not discuss this fully here but only indicate how to use 
this to handle the simple and multiple linear regression models (where the response and 
the predictors are all quantitative), the one- and two-factor models (where the response 
is quantitative but the predictors are categorical), and the logistic regression model 
(where the response is categorical but the predictors are quantitative). Suppose, then, 
that we have a vector y containing the response values. 


Basic Statistics 


The function mean (y) returns the mean of the values in y, var (y) returns the 
sample variance of the values in y, and sd(y) gives the sample standard devia- 
tion. The command median (y) returns the median of y, while quantile (y,p) 
returns the sample quantiles as specified in the vector of probabilities p. For example, 
quantile(y,c(.25,.5,.75)) returns the median and the first and third quan- 
tiles. The function sort (y) returns a vector with the values in y sorted from smallest 
to largest, and rank (y) gives the ranks of the values in y. 


The ¢-Test 
For the data in y, we can use the command 
> t.test(y,mu=1,alternative="two.sided",conf.level=.95) 


to carry out a t-test. This computes the P-value for testing Ho : u = 1 and forms a 
0.95-confidence interval for u. 


The Chi-Squared Test 


Suppose y contains a vector of counts for k cells and prob contains hypothesized 
probabilities for these cells. Then the command 


Appendix B.1: Using R 689 


> chisg.test (y,p=prob) 


carries out the chi-squared test to assess this hypothesis. Note that y could also corre- 
spond to a one-dimensional table. 

If x and y are two vectors of the same length, then chisq.test (x, y) carries 
out a chi-squared test for independence on the table formed by cross-tabulating the 
entries in x and y. If we first create this cross-tabulation in the table t using the 
table function, then chisq.test(t) carries out this test. 


Simple Linear Regression 


Suppose we have a single predictor with values in the vector x. The simple linear 
regression model E (y |x) = £, + 22x is then specified in R by y~x. We refer to y~x 
as a model formula, and read this as “y is modelled as a linear model involving x.” To 
carry out the fitting (which we have done here for a specific set of data), we use the 
fitting linear models command 1m, as follows. The command 


> regexamp <- lm(y~x) 


carries out the computations for fitting and inference about this model and assigns the 
result to a structure called regexamp. Any other valid name could have been used for 
this structure. We can now use various R functions to pick off various items of interest. 
For example, 


> summary (regexamp) 


Call: 

lm(formula = y~x) 

Residuals: 

Min 10 Median 30 Max 

-4.2211 -2.1163 0.3248 1.7255 4.3323 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 6.5228 T2076 D2 Oh 4.3le-05 *** 
x L7531 0.1016 17.248 1.22e-12 *** 


SIALE” codes: ‘QO. -\***0 0R 0 OT, SAT OOS VE 10.0.5. ef 

Or ee SL 

Residual standard error: 2.621 on 18 degrees of freedom 
Multiple R-squared: 0.9429, Adjusted R-squared: 0.9398 


F-statistic: 297.5 on 1 and 18 DF, p-value: 1.219e-12 


uses the summary function to give us all the information we need. For example, the 
fitted line is given by 6.5228 + 1.7531x. The test of Hp : 2 = 0 has a P-value of 
1.22 x 107!?, so we have strong evidence against Ho. Furthermore, the R? is given 
by 94.29%. Individual items can be accessed via various R functions and we refer the 
reader to ? 1m for this. 


690 Appendix B: Computations 


Multiple Linear Regression 


If we have two quantitative predictors in the vectors x1 and x2, then we can proceed 
just as with simple linear regression to fit the linear regression model E (y |x1, x2) = 
By + Box1 + Box2. For example, the commands 


> regex <- Ilm(y~x1+x2) 
> summary (regex) 


fit the above linear model, assign the results of this to the structure regex, and then 
the summary function prints out (suppressed here) all the relevant quantities. We read 
y~x1+x2 as, “y is modelled as a linear model involving x1 and x2.” In particular, 
the F-statistic, and its associated P-value, is obtained for testing Ho : By = f; = 0. 

This generalizes immediately to linear regression models with k quantitative pre- 
dictors x1, ...,x;. Furthermore, suppose we want to test that the model only involves 
X1,...,x; for l < k. We use 1m to fit the model for all k predictors, assign this to 
regex, and also use 1m to fit the model that only involves / predictors and assign 
this to regex1. Then the command anova (regex, regex1) will output the F- 
statistics, and its P-value, for testing Ho : 6)4; =+- = Jk =0. 


One- and Two-Factor ANOVA 


Suppose now that A denotes a categorical predictor taking two levels a, and a2. Note 
that the values of A may be character in value rather than numeric, e.g., x is a character 
vector containing the values a1 and a2, used to denote at which level the correspond- 
ing value of y was observed. In either case, we need to make this into a factor A, via 
the command 


> A <- factor (x) 
so that A can be used in the analysis. Then the command 
> aov(y~A) 


produces the one-way ANOVA table. Of course, aov also handles factors with more 
than two levels. To produce the cell means, use the command tapply(y,A,mean). 

Suppose there is a second factor B taking 5 levels b1, ..., bs. If this is the factor B 
in R, then the command 


> aov(y~At+Bt+A:B) 


produces the two-way ANOVA for testing for interactions between factors A and B. To 
produce the cell means, use the command tapply(y, list (A,B),mean). The 
command aov (y~A+B) produces the ANOVA table, assuming that there are no inter- 
actions. 


Logistic Regression 


Suppose we have binary data stored in the vector y, and x contains the corresponding 
values of a quantitative predictor. Then we can use the generalized linear model com- 
mand glm to fit the logistic regression model P(Y = 1|x) = exp{f, + Box}/U + 
exp{8, + £2x}). The commands 


Appendix B.1: Using R 691 


> logreg <- glm(y~x, family=binomial) 
> summary (logreg) 


fit the logistic regression model, assign the results to Logreg, and then the summary 
command outputs this material. This gives us the estimates of the f;, their standard 
errors, and P-values for testing that the 6; = 0. 


Control Statements and R Programs 


A basic control statement is ofthe form if (exprl) expr2 else expr3, where 
expr] takes a logical value, expr2 is executed if expr1 is T, and expr3 is executed 
if expr is F. For example, if x is a variable taking value —2, then 


> if (x<0) {y <- -l} else {y <- 1} 


results in y being assigned the value —1. Note that the else part of the statement can 
be dropped. 

The command for (name in expr) expr2 executes expr2 for each value 
of name in expr1. For example, 


> for (i in 1:10) print(i) 


prints the value of the variable i as i is sequentially assigned values in {1,2,...., 10}. 
Note that m:n is a shorthand for the sequence (m,m + 1,...,m) in R. As another 
example, 


> for (i in 1:20) y[i] <- 2%i 


creates a vector y with 20 entries, where the ith element of y equals 2’. 

The break terminates a loop, perhaps based on some condition holding, while 
next halts the processing of the current iteration and advances the looping index. 
Both break and next apply only to the innermost of nested loops. 

Commands in R can be grouped by placing them within braces {expr1; expr2; 
...}. The commands within the braces are executed as a unit. For example, 


> for (i in 1:20) {print(i); y[i] <- 2^i}; print (y[i]) } 


causes i to be printed, y [i] to be assigned, and y [i] to be printed, all within a for 
loop. 

Often when a computation is complicated, such as one that involves looping, it 
is better to put all the R commands in a single file and then execute the file in batch 
mode. For example, suppose you have a file prog.R containing R code. Then the 
command source ("pathname/prog.R") causes all the commands in the file to 
be executed. 

It is often convenient to put comments in R programs to explain what the lines of 
code are doing. A comment line is preceded by # and of course it is not executed. 


User-Defined Functions 


R also allows user-defined functions. The syntax of a function definition is as follows. 


692 Appendix B: Computations 


function name <- function(arguments) { 
function body; 
return(return value); 


} 


For example, the following function computes the sample coefficient of variation of the 
data x. 


coef var <- function(x) { 
result <- sd(x)/mean(x); 
return (result); 


} 


Then if we want to subsequently compute the coefficient of variation of data y, we 
simply type coef _var (y). 


Arrays and Lists 


A vector of length m can also be thought of as a one-dimensional array of length m. 
R can handle multidimensional arrays, e.g., m x n,m x n x p arrays, etc. Ifa is a 
three-dimensional array, then a [i, j, k] refers to the entry in the (i, j, k)-th position 
of the array. There are various operations that can be carried out on arrays and we refer 
the reader to the manual for these. Later in this manual, we will discuss the special 
case of two-dimensional arrays, which are also known as matrices. For now, we just 
think of arrays as objects in which we store data. 

A very general data structure in R is given by a list. A list is similar to an array, 
with several important differences. 


1. Any entry in an array is referred to by its index. But any entry in a list may 
be referred to by a character name. For example, the fitted regression coeffi- 
cients are referred to by regexScoefficients after fitting the linear model 
regex <- lm(y ~ x1 + x2). The dollar mark ($) is the entry reference 
operator, that is, varnameS$entname indicates the “entname” entry in the list 
“varname.” 


2. While an array stores only the same type of data, a list can store any R objects. 
For example, the coefficients entry in a linear regression object is a nu- 
meric vector, and the model entry is a list. 


3. The reference operators are different: arr [i] refers to the ith entry in the array 
arr, and 1st [[i]] refers to the ith entry in the list 1st. Note that 7 can be 
the entry name, i.e., LstSentname and lst[[’entname’ ]] refer to the 
same data. 


Examples 


We now consider some examples relevant to particular sections or examples in the main 
text. To run any of these codes, you first have to define the functions. To do this, load 


Appendix B.1: Using R 693 


the code using the source command. Arguments to the functions then need to be 
specified. Note that lines in the listings may be broken unnaturally and continue on the 
following line. 


EXAMPLE B.1.1 Bootstrapping in Example 6.4.2 

The following R code generates bootstrap samples and calculates the median of each 
of these samples. To run this code, type y <- bootstrap median (m, x), where 
m is the number of bootstrap samples, x contains the original sample, and the medians 
of the resamples are stored in y. The statistic to be bootstrapped can be changed by 
substituting for median in the code. 


Example B.1.1 
function name: bootstrap median 
parameters: 
m resample size 
x original data 
return value: 
a vector of resampled medians 
description: resamples and stores its median 
bootstrap median <- function(m,x) { 
n <- length (x); 
result <- rep(0,m); 
for(i in 1l:m) result[i] <- median(sample(x,n,T)); 
return (result); 


} 


EXAMPLE B.1.2 Sampling from the Posterior in Example 7.3.1 
The following R code generates a sample of from the joint posterior in Example 7.3.1. 
To run a simulation, type 


post <- post_normal (m,x,alpha0,beta0,mu0, tau0square) 


where m is the Monte Carlo sample size and the remaining arguments are the hyperpa- 
rameters of the prior. The result is a list called (in this case) post, where pos t$mu 
and post$sigmasq contain the generated values of u and o”, respectively. For 
example, 


x <> -¢(11.6714,. 1.8957, 2.2228, 2712386, 1.0751, 8.1631, 
.8236, 4.0362, 6.8513, 7.6461, 1.9020, 7.4899, 4.9233, 
.3223, 7.9486); 

post <- post_normal (10**4,x,2,1,4,2) 

z <- sqrt (postSsigmasq) /postS$mu 


VV @r y 


runs a simulation as in Example 7.3.1, with N = 104. 


# Example B.1.2 
function name: post_normal 
parameters: 

m sample size 


SH H HE 


694 Appendix B: Computations 


x data 

alphaO shape parameter for 1/sigma^2 
beta0 rate parameter for 1/sigma^2 
mu0 location parameter for mu 


tau0square variance ratio parameter for mu 
returned values: 

mu sampled mu 

sigmasq sampled sigmasquare 
description: samples from the posterior distribution 
in Example 7.3.1 


post_normal<-function (m, x,alpha0,beta0,mu0, tau0square) { 
set the length of the data 
n <- length (x); 
the shape and rate parameters of the posterior dist. 
alpha_x = first parameter of the gamma dist. 

= (alphaO + n/2) 
alpha x <- alpha0 + n/2 


beta_x = the rate parameter of the gamma dist. 
beta x <= betaO + (n-1)/2 * var(x) + n*(mean(x)-mu0]**2/ 
2/ (1+n*tau0square) ; 
mu x = the mean parameter of the normal dist. 
mu _ x <- (mu0/tau0squaretn*mean (x) )/(nt+t1/tau0square) ; 
tausq x = the variance ratio parameter of the normal 
distribution 


tausq_x <= 1/(n+1/tau0square) ; 

initialize the result 

result <- list(); 

resultSsigmasq <- 1/rgamma(m,alpha_x,rate=beta_x); 
resultSmu <- rnorm(m,mu_x,sqrt(tausq_x * 
result$sigmasgq) ); 


return (result); 
} 
| 


EXAMPLE B.1.3 Calculating the Estimates and Standard Errors in Example 7.3.1 
Once we have a sample of values from the posterior distribution of y stored in psi, 
we can calculate the interval given by the mean value of psi plus or minus 3 standard 
deviations as a measure of the accuracy of the estimation. 


# Example B.1.3 

# set the data 

x OC (L1LS6/14, E89 9y 50 2512284. -2:123'6,. 1.0 7a 8.1632, 
1.8236, 4.0362, 6.8513, 7.6461, 1.9020, 7.4899, 
4.9233, 8.3223, 7.9486); 

post <- post_normal (10**4,x,2,1,4,2); 

# compute the coefficient of variation 


Appendix B.1: Using R 695 


psi <- sqrt (post$sigmasq) /post$mu; 

psi_hat <- mean(psi <= .5); 

psi se <- sgrt(psi_hat * (l1-psi_hat))/sqrt (length(psi)); 

# the interval 

cq <- 3 

cat ("The thr times s.e. interval is ", 
"[",psi_hat-cq*psi_se, ", ", psi_hattcq*psi_se,"]\n"); 


EXAMPLE B.1.4 Using the Gibbs Sampler in Example 7.3.2 
To run this function, type 


post<-gibbs normal (m,x,alpha0,beta0, lambda,mu0, 
tau0sq, burnin=0) 


as this creates a list called post, where post $mu and post$sigmasq contain the 
generated values of u and o°, respectively. Note that the burnin argument is set to 
a nonnegative integer and indicates that we wish to discard the first burnin values of 
u and ø? and retain the last m. The default value is burnin=0. 


Example B.1.4 
function name: gibbs_normal 


parameters 
m the size of posterior sample 
Xx data 
alpha0 shape parameter for 1/sigma^2 
beta0 rate parameter for 1/sigma^2 
lambda degree of freedom of Student’s t-dist. 
muQ location parameter for mu 
tau0sq scale parameter for mu 
burnin size of burn in. the default value is 0. 


returnrd values 

mu sampled mu’s 

sigmasq sampled sigma’*2’s 

description: samples from the posterior in Ex. 7.3.2 


gibbs_ normal <- function(m,x,alpha0,beta0,lambda,mu0, 
tau0sq,burnin=0) { 
# initialize the result 
result <- list(); 
resultSsigmasq <- result$mu <- rep(0,m); 
# set the initial parameter 


mu <- mean (x); 
sigmasq <- var (x); 
n <- length (x); 


# set parameters 


696 Appendix B: Computations 


alpha_x <- n/2 + alphaO + 1/2; 

# loop 

for(i in (l-burnin):m) { 

# update v i's 

v <- rgamma (n, (lambda+1)/2,rate=((x-mu)**2/ 
sigmasq/lambda+1)/2); 

# update sigma-square 

beta_x <- (sum (v* (x-mu) **2) /lambda+ (mu-mu0) **2/ 
tau0sq) /2+beta0; 
sigmasgq<- 1/rgamma(1,alpha_x, rate=beta_x); 
update mu 

r <- 1/(sum(v)/lambda+1/tau0sq); 

mu <- rnorm(1,r*(sum(v*x)/lambda+mu0/tau0sq), 
sqrt (r*sigmasq)); 


burnin check 

if(i < 1) next; 

resultSmu[i] <- mu; 
resultSsigmasq[i] <- sigmasq; 


} 
resultSpsi <- sqrt(resultSsigmasq) /result$mu; 
return (result); 


EXAMPLE B.1.5 Batching in Example 7.3.2 

The following R code divides a series of data into batches and calculates the batch 
means. To run the code, type y<-batching (k, x) to place the consecutive batch 
means of size k, of the data in the vector x, in the vector y. 


Example B.1.5 
function name: batching 


parameters: 
k size of each batch 
x data 


return value: 
an array of the averages of each batch 
description: this function separates the data x into 
floor (length (x)/k) batches and returns the array of 
the averages of each batch 

batching <- function(k,x) { 

m <- floor (length (x) /k); 

result <- rep(0,m); 

for(i in 1:m) result[i] <- mean(x[(i-1) *k+(1:k)]); 
return (result); 


Appendix B.1: Using R 697 


EXAMPLE B.1.6 Simulating a Sample from the Distribution of the Discrepancy Sta- 
tistic in Example 9.1.2 

The following R code generates a sample from the discrepancy statistic specified in 
Example 9.1.2. To generate the sample, type y<-discrepancy (m,n) to place a 
sample of size m in y, where n is the size of the original data set. This code can be 
easily modified to generate samples from other discrepancy statistics. 


Example B.1.6 
function name: discrepancy 


parameters: 
m resample size 
n size of data 


return value: 
an array of m discrepancies 
description: this function generates m discrepancies 
when the data size is n 
discrepancy <- function(m,n) { 
result <- rep(0,m); 
for(i in 1:m) { 


x <- rnorm(n); 

xbar <- mean (x); 

r <- (x-xbar)/sqrt ((sum((x-xbar) **2))); 
result[i] <- -sum(log(r**2)); 


} 

return (result/n); 
} 
I 


EXAMPLE B.1.7 Generating from a Dirichlet Distribution in Example 10.2.3 

The following R code generates a sample from a Dirichlet(a1, a2, a3, a4) distribution. 
To generate from this distribution, first assign values to the vector alpha and then 
type ddirichlet (n, alpha), where n is the sample size. 


Example B.1.7 
function name: ddirichlet 
parameters: 

n sample size 

alpha vector(alphal,...,alphak) 
return value: 

a (n x k) matrix. rows are i.i.d. samples 
description: this function generates n random samples 
from Dirichlet (alphal,...,alphak) distribution 
ddirichlet <- function(n,alpha) { 

k <- length (alpha); 

result <- matrix(0,n,k); 

for(i in 1:k) result[,i] <- rgamma(n,alpha[i]); 

for(i in 1:n) result[i,] <- result[i,] / 
sum(result[i,]); 


698 Appendix B: Computations 


return (result); 


Matrices 


A matrix can be thought of as a collection of data values with two subscripts or as a 
rectangular array of data. So if a is a matrix, then a [i,j] is the (i, /)-th element in 
a. Note that a[i,] refers to the ith row of a and a[, 4] refers to the jth column of 
a. Ifa matrix has m rows and n columns, then it is an m x n matrix, and m and n are 
referred to as the dimensions of the matrix. 

Perhaps the simplest way to create matrices is with cbind and rbind commands. 
For example, 


> x<-c(1,2,3) 
> y<-c(4,5,6) 
> a<-cbhind(x,y) 
>a 


x 
[1 
[2 
[3 
creates the vectors x and y, and the cbind command takes x as the first column 
and y as the second column of the newly created 3 x 2 matrix a. Note that in the 
printout of a, the columns are still labelled x and y, although we can still refer to 
these as a[, 1] and a[,2]. We can remove these column names via the command 
colnames (a) <-NULL. Similarly, the roind command will treat vector arguments 
as the rows of a matrix. To determine the number of rows and columns of a matrix a, 
we can use the nrow(a) and ncol (a) commands. We can also create a diagonal 
matrix using the diag command. If x is an n-dimensional vector, then diag (x) is 
ann x n matrix with the entries in x along the diagonal and 0’s elsewhere. If a is an 
m x n matrix, then diag (a) is the vector with entries taken from the main diagonal 
of a. To create an n x n identity matrix, use diag (n). 

There are a number of operations that can be carried out on matrices. If matrices 
a and b are m x n, then a+b is the m x n matrix formed by adding the matrices 
componentwise. The transpose of a is the n x m matrix t (a), with ith row equal 
to the ith column of a. If c is a number, then c*a is the m x n matrix formed by 
multiplying each element of a by c. Ifa ism x n and b isn x p, then a%* %b is the 
m x p matrix product (Appendix A.4) of a and b. A numeric vector is treated as a 
column vector in matrix multiplication. Note that a*b is also defined when a and b 
are of the same dimension, but this is the componentwise product of the two matrices, 
which is quite different from the matrix product. 

If a is an m x m matrix, then the inverse of a is obtained as solve (a). The 
solve command will return an error if the matrix does not have an inverse. If a is a 
square matrix, then det (a) computes the determinant of a. 

We now consider an important application. 


` 


` 
~-= 


` 


Appendix B.2: Using Minitab 699 


EXAMPLE B.1.8 Fitting Regression Models 

Suppose the n-dimensional vector y corresponds to the response vector and the n x k 
matrix V corresponds to the design matrix when we are fitting a linear regression model 
given by E (y | V) = V £. The least-squares estimate of £ is given by b as computed in 


b<-solve (t (V) S*SV) S*St (V) S* Sy 
with the vector of predicted values p and residuals r given by 


> p<-VS*Sb 


> r<-y-p 
with squared lengths 

> slp<-t(p) 5*%Sp 

> slr<-t(r)o*%r 


where slp is the squared length of p and sir is the squared length of r. Note that 
the matrix solve (t (V) %*%V) is used for forming confidence intervals and tests for 
the individual f;. Virtually all the computations involved in fitting and inference for 
the linear regression matrix can be carried out using matrix computations in R like the 
ones we have illustrated. E 


Packages 


There are many packages that have been written to extend the capability of basic R. It is 
very likely that if you have a data analysis need that cannot be met with R, then you can 
find a freely available package to add. We refer the reader to ?install.packages 
and ?library for more on this. 


B.2 | Using Minitab 


All the computations found in this text were carried out using Minitab. This statistical 
software package is very easy to learn and use. Other packages such as SAS or R (see 
Section B.1) could also be used for this purpose. 

Most of the computations were performed using Minitab like a calculator, i.e., data 
were entered and then a number of Minitab commands were accessed to obtain the 
quantities desired. No programming is required for these computations. 

There were a few computations, however, that did involve a bit of programming. 
Typically, this was a computation in which numerous operations had to be performed 
many times, and so looping was desirable. In each such case, we have recorded here the 
Minitab code that we used for these computations. As the following examples show, 
these programs were never very involved. 

Students can use these programs as templates for writing their own Minitab pro- 
grams. Actually, the language is so simple that we feel that anyone using another 
language for programming can read these programs and use them as templates in the 
same way. Simply think of the symbols c1, c2, etc. as arrays where we address the 
ith element in the array c1 by cl (i). Furthermore, there are constants k1, k2, etc. 


700 Appendix B: Computations 


A Minitab program is called a macro and must start with the statement gmacro 
and end with the statement endmacro. The first statement after gma cro gives a name 
to the program. Comments in a program, put there for explanatory purposes, start with 
note. 

If the file containing the program is called prog. txt and this is stored in the root 
directory of a disk drive called c, then the Minitab command 


MTB> %c:/prog.txt 


will run the program. Any output will either be printed in the Session window (if you 
have used a print command) or stored in the Minitab worksheet. 

More details on Minitab can be found by using Help in the program. We provide 
some examples of Minitab macros used in the text. 


EXAMPLE B.2.1 Bootstrapping in Example 6.4.2 

The following Minitab code generates 1000 bootstrap samples from the data in c1, 
calculates the median of each of these samples, and then calculates the sample variance 
of these medians. 


gmacro 

bootstrapping 

base 34256734 

note - original sample is stored in cl 

note - bootstrap sample is placed in c2 with each one 
note overwritten 

note - medians of bootstrap samples are stored in c3 
not kl = size of data set (and bootstrap samples) 
let k1=15 

do k2=1:1000 

sample 15 cl c2; 

replace. 

let c3(k2)=median (c2) 

enddo 

note - k3 equals (6.4.5) 

let k3=(stdev(c3))**2 

print k3 

endmacro 

| 


EXAMPLE B.2.2 Sampling from the Posterior in Example 7.3.1 

The following Minitab code generates a sample of 104 from the joint posterior in Ex- 
ample 7.3.1. Note that in Minitab software, the Gamma(a, £) density takes the form 
(B-*/T (a))x*—!e-*/F, So to generate from a Gamma(a, 8) distribution, as defined 
in this book, we must put the second shape parameter equal to 1/2 in Minitab. 


gmacro 

normalpost 

note - the base command sets the seed for the random 
note numbers 


Appendix B.2: Using Minitab 701 


base 34256734 


note - the parameters of the posterior 

note - kl = first parameter of the gamma distribution 

note = (alpha. 0 + n/2) 

let k1=9.5 

note - k2 = 1/beta 

let k2=1/77.578 

note - k3 = posterior mean of mu 

let k3=5.161 

not k4 = (n + 1/(tau_0 squared) )%*(-1) 

let k4=1/15.5 

note - main loop 

note - c3 contains generated value of sigma**2 

note - c4 contains generated value of mu 

note - c5 contains generated value of coefficient of 
variation 


do k5=1:10000 

random 1 cl; 

gamma k1 k2. 

let c3(k5)=1/c1 (1) 
let k6=sqrt (k4/c1(1)) 
random 1 c2; 

normal k3 k6. 

let c4(k5)=c2 (1) 

let c5(k5)=sqrt (c3(k5) ) /c4 (k5) 
enddo 

endmacro 

| 


EXAMPLE B.2.3 Calculating the Estimates and Standard Errors in Example 7.3.1 
We have a sample of 104 values from the posterior distribution of y stored in C5. 
The following computations use this sample to calculate an estimate of the posterior 
probability that y < 0.5 (k1), as well as to calculate the standard error of this estimate 
(k2), the estimate minus three times its standard error (k3), and the estimate plus three 
times its standard error (k4). 


let c6=c5 le .5 

let kl=mean(c6) 

let k2=sqrt (k1* (1-k1))/sqrt (10000) 
let k3=k1-3*k2 

let k4=k1+3*k2 

print kl k2 k3 k4 

| 


702 Appendix B: Computations 


EXAMPLE B.2.4 Using the Gibbs Sampler in Example 7.3.2 
The following Minitab code generates a chain of length 10+ values using the Gibbs 
sampler described in Example 7.3.2. 


gmacro 

gibbs 

base 34256734 

note - data sample is stored in cl 
note - starting value for mu. 
let kl=mean (c1) 

note - starting value for sigma**2 
let k2=stdev (cl) 

let k2=k2**2 

no - lambda 

let k3=3 

no - sample size 

let k4=15 

no - n/2 + alpha_0 + 1/2 

let k5=k4/2 +2+.5 

no - mu 0 

let k6=4 

no = tau 0**2 

let k7=2 

no - beta_0 

let k8=1 

let k9=(k3/2+.5) 

note - main loop 

do k100=1:10000 

not generate the nu i in c10 
do k111=1:15 

let k10=.5* (( (cl (k111) -k1) **2) / (k2*k3) +1) 
let k10=1/k10 

random 1 c2; 

gamma k9 k10. 

let c10(k111)=c2 (1) 

enddo 

note - generate sigma**2 in c20 
let c11=c10* ((c1-k1)**2) 

let k11=.5*sum(cl11)/k3+.5* ((k1-k6) **2) /k7 +k8 
let k11=1/k11 

random 1 c2; 

gamma k5 k11. 

let c20(k100)=1/c2 (1) 

let k2=1/c2 (1) 

note - generate mu in c21 

let k13=1/ (sum(c10)/k3 +1/k7) 


(C 
0) 


0) 


0) 


0) 


0) 


0) 


0) 


CF ST re er ee E Ee a ae ere er oe E i 


cr oct 


Appendix B.2: Using Minitab 703 


le 
le 
le 
le 


cll=cl*cl10/k3 
k14=sum(c11)+k6/k7 
k14=k13*k14 
k13=sqrt (k13*k2) 


Cr er eh, CF 


random 1 c2; 


no 
le 
le 


rmal k14 k13. 
t c21 (k100)=c2 (1) 
t kl=c2 (1) 


enddo 
endmacro 


EXAMPLE B.2.5 Batching in Example 7.3.2 
The following Minitab code divides the generated sample, obtained via the Gibbs sam- 
pling code for Example 7.3.2, into batches, and calculates the batch means. 


gmacro 


ba 
no 
le 
no 
no 
no 
do 
le 
do 
le 
le 


tching 

te - k2= batch size 

t k2=40 

te - k4 holds the batch sums 

te - cl contains the data to be batched (10000 data values) 
te - c2 will contain the batch means (250 batch means) 
k10=1:10000/40 

t k4=0 

k20=0:39 

t k3=cl (k10+k20) 

t k4=k4+k3 


enddo 


le 
le 


t kl1l=floor(k10/k2) +1 
t c2(k11)=k4/k2 


enddo 
endmacro 


EXAMPLE B.2.6 Simulating a Sample from the Distribution of the Discrepancy Sta- 
tistic in Example 9.1.2 

The following code generates a sample from the discrepancy statistic specified in Ex- 
ample 9.1.2. 


gmacro 
goodnessoffit 
base 34256734 
note - generated sample is stored in cl 


note - residuals are placed in c2 
note - value of D(r) are placed in c3 
not kl = size of data set 


let k1=5 


704 


Appendix B: Computations 


do k2=1:10000 
random kl cl 

let k3= 
let k4= 
let c2= 


let c2= 


mean (c1) 

sqrt (k1-1) *stdev (c1) 
((c1-k3) /k4) **2 

loge (c2) 


let k5= 
let c3( 


enddo 
endmacro 


-sum(c2)/k1 
k2)=k5 


EXAMPLE B.2.7 Generating from a Dirichlet Distribution in Example 10.2.3 
The following code generates a sample from a Dirichlet(a1, a2, a3, a4) distribution, 
where a, = 2, a2 = 3, a3 =1,a4 = 1.5. 


gmacro 


di 
no 
no 


richl 
te - 
te 


base 34 


no 
no 
no 
le 
le 
le 
le 
le 
le 
no 
no 
do 


be 
le 


be 
le 


le 


random 


random 


random 


pe "= 
te 

ae 
kl= 
k2= 
k3= 
k4= 
k5= 
k6= 


et 

the base command sets the seed for the random 
number generator (so you can repeat a simulation). 
256734 

here we provide the algorithm for generating from 
a Dirichlet (k1,k2,k3,k4) distribution. 

assign the values of the parameters. 

2 

3 

1 

155 

K2+k3+k4 

k3+k4 


Ct ae AE CO FS: ST oe er rt 


e 
k10= 


ta kl 
t c2( 


ta k2 
t c3 ( 


ta k3 
t c4( 
Ed" ( 


enddo 
endmacro 


generate the sample with i-th sample in i-th row 
Of <e2,.C3,-704;- C5; 

13:5 

LT 2eLy 

k5. 

k10)=c1 (1) 

fh «cl; 

k6. 

k10)=(1-c2(k10))*c1(1) 

Let; 

k4. 

k10) =(1-c2 (k10)-c3(k10))*c1(1) 
k10)= 1-c2 (k10) -c3 (k10) -c4 (k10) 


Appendix C 
Common Distributions 


We record here the most commonly used distributions in probability and statistics as 
well as some of their basic characteristics. 


C.1 | Discrete Distributions 
1. Bernoulli(@), 0 € [0, 1] (same as Binomial(1, 0)). 


probability function: p(x) = 6* (1 — 6)'~* for x =0, 1. 
mean: 0. 

variance: 0(1 — 0). 

moment-generating function: m(t) = (1 — 0 + ĝe) for t € R!. 


2. Binomial(n, 0), n > 0 an integer, 0 € [0, 1]. 


probability function: p(x) = Hra —0)"™ forx =0,1,...,n. 

mean: nð. 

variance: n@(1 — 0). 

moment-generating function: m(t) = (1 — 0 + ĝe)” fort e R!. 

3. Geometric(™), 0 € (0, 1] (same as Negative-Binomial(1, 0)). 

probability function: pœ) = (1 — 0)*0 for x =0,1,2,.... 

mean: (1 — 0)/8. 

variance: (1 — 6)/07. 

moment-generating function: m(t) = 0(1 — (1 —4)e’)7! fort < —In(1 — 0). 


4. Hypergeometric(N, M,n), M < N,n < N all positive integers. 


probability function: 
N- N : 
px) = C A ) for max(0, n+ M — N) < x < min(n, M). 
x n—x n 
.„ M 
mean: n=. 


panne, | M) N-n 
variance: n= (1 — 4) R. 


5. Multinomial(n, 0), ...,0%), n > 0 an integer, each 0; € [0,1], 01 +--- +0, =1. 


705 


706 Appendix C: Common Distributions 


probability function: 


Pi... Xk) = ( É JOÑO} where each x © (0, 1,-.>n) 
X1... Xk 


andx; +--+ xk =n. 
mean: E(X;) = nð;. 
variance: Var(X;) = n0; (1 — 0;). 
covariance: Cov(X;, X;) = —n0;0; when i # j. 
6. Negative-Binomial(r,0),r > 0 an integer, 0 € (0, 1]. 
probability function: p(x) = (meqa — 0) forx =0,1,2,3,.... 
mean: r(1 — 0)/0. 
variance: r(1 — 0) /6?. 
moment-generating function: m(t) = 6" (1 — (1 —@)e’) fort < —In(1 — 0). 
7. Poisson(A), 4 > 0. 
probability function: p(x) = Le for x =0,1,2,3,.... 
mean: A. 


variance: A. 
moment-generating function: m(t) = exp{A(e’ — 1)} fort € R!. 


C.2 | Absolutely Continuous Distributions 
1. Beta(a, b),a > 0, b > 0 (same as Dirichlet(a, b)). 
density function: f(x) = Tore —x)?-! for x e (0, 1). 
mean: a/(a + b). 
variance: ab/(a +b + 1)(a +b). 
2. Bivariate Normal(t11, u2, o, o, p) for ui, u2 € RÈ, o, o2 > 0,p €[-1, 1]. 
density function: 


Jx, X% (1, X2) 
aay? [peany 
1 1 0] ag 02 z 


7 2261021 — p? 


for xı € R! x2 e R!, 


mean: E(X;) = p;. 

variance: Var(X;) = oe. 

covariance: Cov(X1, X2) = po 02. 

3. Chi-squared(a) or x7(a), a > 0 (same as Gamma(a /2, 1/2)). 
density function: f(x) = 27/2(T(a/2))7!x@/9-le-*/? for x > 0. 
mean: a. 

variance: 2a. 


Appendix C.2: Absolutely Continuous Distributions 707 


moment-generating function: m(t) = (1 — Dt) 8/4 fort < 1/2. 


4. Dirichlet(a,,..., x41), &i > 0 for each i. 
density function: 


SXi, 5 OAS -3 Xk) 

Dai t:i + aky) a, 
Dr(a) Tar) | 
for x; > 0,i = 1,...,kand0 < xi +- +x <1. 


=1 = 
exi (l= xi — -e xg) 1 


mean: a; 
E(X) = ——————_-. 
ai ress + aky 


variance: 


ailai +: + aky — ai) 
Var(X) = MM. 
4D (ay +--+ + ak4)? + ay +--+ + aka) 


covariance when i Æ j: 


Aja; 


Cov(X;, X;) = ———————————— s. 
faran (ai + Hak) (A Hai H H aky) 


5. Exponential(A), 4 > 0 (same as Gamma(1, )). 


density function: f(x) = Ae~** for x > 0. 

mean: A7!, 

variance: 2~?. 

moment-generating function: m(t) = 2(A — t)7! fort < À. 

Note that some books and software packages instead replace 2 by 1/4 in the definition 
of the Exponential(1) distribution — always check this when using another book or 
when using software to generate from this distribution. 


6. F (a, P),a>0,f > 0. 
density function: 
a Neh a \~@tP)/2 g 
———— [| -x 1+ Sx) = 
Ge) eal a 


mean: £/(8 — 2) when £ > 2. 

variance: 287(a + B — 2)/a(B — 2)°(B — 4) when £ > 4. 
7. Gamma(a,/),a > 0,4 > 0. 

density function: f(x) = a et forx > 0. 

mean: a/A. 

variance: a/A”. 

moment-generating function: m(t) = 4% (4 — t)~? fort <2. 


708 Appendix C: Common Distributions 


Note that some books and software packages instead replace 4 by 1/2 in the definition 
of the Gamma(a, 4) distribution — always check this when using another book or 
when using software to generate from this distribution. 


8. Lognormal or log N (u, 07), u € R!, o? > 0. 

density function: f(x) = (2707)7!/2x—! exp (-z0nx — ny) for x > 0. 
mean: exp(u + 07/2). 

variance: exp(2u + 07)(exp(o7) — 1). 

9. N(u, 07), u € R!, o? > 0. 


density function: f(x) = (2x o°)! exp (-sho — m’) forx e R!. 


mean: 4. 


variance: c°. 


moment-generating function: m(t) = exp(ut + o°t?/2) fort € R!. 
10. Student(a) or t(a), a > 0 (a = 1 gives the Cauchy distribution). 


density function: 


a 


1 = 
T (+) ( A (@+1)/2 
r $) 


Aa 
Ja 


mean: 0 when a > 1. 
variance: a/(a — 2) when a > 2. 


11. Uniform[L, R], R > L. 

density function: f(x) = 1/(R — L) for L <x < R. 

mean: (L + R)/2. 

variance: (R — L)? /12. 

moment-generating function: m(t) = (eft — et‘) /t(R — L). 


Appendix D 
Tables 


The following tables can be used for various computations. It is recommended, how- 
ever, that the reader become familiar with the use of a statistical software package 
instead of relying on the tables. Computations of a much greater variety and accuracy 
can be carried out using the software, and, in the end, it is much more convenient. 


709 


D.1 


710 Appendix D: Tables 


Random Numbers 


Each line in Table D.1 is a sample of 40 random digits, i.e., 40 independent and identi- 
cally distributed (1.i.d.) values from the uniform distribution on the set {0, 1, 2, 3, 4, 5, 
6, 7, 8, 9}. 

Suppose we want a sample of five i.i.d. values from the uniform distribution on 
S = {1,2,...,25}, i.e., a random sample of five, with replacement, from S. To do 
this, pick a starting point in the table and start reading off successive (nonoverlapping) 
two-digit numbers, treating a pair such as 07 as 7, and discarding any pairs that are not 
in the range 1 to 25, until you have five values. For example, if we start at line 110, we 
read the pairs (* indicates a sample element) 38, 44, 84, 87, 89, 18*, 33, 82, 46, 97, 39, 
36, 44, 20*, 06*, 76, 68, 80, 87, 08%, 81, 48, 66, 94, 87, 60, 51, 30, 92, 97, 00, 41, 27, 
12*. We can see at this point that we have a sample of five given by 18, 20, 6, 8, 12. 

If we want a random sample of five, without replacement, from S, then we proceed 
as above but now ignore any repeats in the generated sample until we get the five 
numbers. In this preceding case, we did not get any repeats, so this is also a simple 
random sample of size five without replacement. 


Table D.1 Random Numbers 


Appendix D.1: Random Numbers 711 


Table D.1 Random Numbers (continued) 


712 Appendix D: Tables 


D.2 | Standard Normal Cdf 


If Z ~ N(O, 1), then we can use Table D.2 to compute the cumulative distribution 
function (cdf) ® for Z. For example, suppose we want to compute ®(z) = P(Z < 
1.03). The symmetry of the N(0, 1) distribution about 0 implies that ®(z) = 1 — 
@(—z), so using Table D.2, we have that P(Z < 1.03) = P(Z < 1.03) =1-—P(Z < 
—1.03) = 1 — 0.1515 = 0.8485. 


Table D.2 Standard Normal Cdf 
z .00 0l .02 03 


Appendix D.3: Chi-Squared Distribution Quantiles 


D.3 | Chi-Squared Distribution Quantiles 


If X ~ y? (df), then we can use Table D.3 to obtain some quantiles for this distribution. 
For example, if df = 10 and P = 0.98, then xo.9g = 21.16 is the 0.98 quantile of this 


distribution. 


Table D.3 ~~ (df) Quantiles 
df 


Co mOrN DN FB WN 


omArnNI DN FwWN KE OO 


FmaaunkswWNYNYNN NY NNN NN NY 
SS SN Gig ahs SY) A e 


0.75 
1.32 
2T 
4.11 
5.39 
6.63 
7.84 
9.04 

10.22 
11.39 
12.55 
13.70 
14.85 
15.98 
17.12 
18.25 
19.37 
20.49 
21.60 
22.72 
23.83 
24.93 
26.04 
27.14 
28.24 
29.34 
30.43 
31.53 
32.62 
33.71 
34.80 
45.62 
56.33 
66.98 
88.13 
109.1 


0.85 
2.07 
3.79 
5.32 
6.74 
8.12 
9.45 
0.75 
2.03 
3.29 
4.53 
SAT 
6.99 
8.20 
9.41 

20.60 

21.79 

22.98 

24.16 

25.33 

26.50 

27.66 

28.82 

29.98 

31.13 

32.28 

33.43 

34.57 

35.71 

36.85 

37.99 

49.24 

60.35 

71.34 

93.11 
114.7 


0.90 
2.71 
4.61 
6.25 
7.78 
9.24 

10.64 
12.02 
13.36 
14.68 
15.99 
17.28 
18.55 
19.81 
21.06 
22.31 
23.54 
24.77 
25.99 
27.20 
28.41 
29.62 
30.81 
32.01 
33.20 
34.38 
35.56 
36.74 
37.92 
39.09 
40.26 
51.81 
63.17 
74.40 
96.58 
118.5 


0.95 
3.84 
5.99 
7.81 
9.49 

11.07 

12.59 

14.07 

15.51 

16.92 

18.31 

19.68 

21.03 
22.36 
23.68 
25.00 
26.30 
27.59 
28.87 
30.14 
31.41 
32.67 
33.92 
35.17 
36.42 
37.65 
38.89 
40.11 
41.34 
42.56 
43.77 
55.76 
67.50 
79.08 
101.9 
124.3 


P 
0.975 
5.02 
7.38 
9.35 
11.14 
12.83 
14.45 
16.01 
17.53 
19.02 
20.48 
21.92 
23.34 
24.74 
26.12 
27.49 
28.85 
30.19 
31.53 
32.85 
34.17 
35.48 
36.78 
38.08 
39.36 
40.65 
41.92 
43.19 
44.46 
45.72 
46.98 
59.34 
71.42 
83.30 
106.6 
129.6 


0.98 
5Al 
7.82 
9.84 

11.67 

13.39 

15.03 

16.62 

18.17 

19.68 

21.16 

22.62 

24.05 

25.47 

26.87 

28.26 

29.63 

31.00 

32.35 

33.69 

35.02 

36.34 

37.66 

38.97 

40.27 

41.57 

42.86 

44.14 

45.42 

46.69 

47.96 

60.44 

72.61 

84.58 

108.1 

131.1 


0.99 


0.995 


0.9975 


0.999 


713 


D.4 


714 Appendix D: Tables 


t Distribution Quantiles 

Table D.4 contains some quantiles for ¢ or Student distributions. For example, if X ~ 
t(df), withdf = 10 and P = 0.98, then xo.98 = 2.359 is the 0.98 quantile of the ¢(10) 
distribution. Recall that the t (df) distribution is symmetric about 0 so, for example, 
X0.25 = —X0.75- 


Table D.4 t (d f) Quantiles 


P 
df 0.75 0.85 0.90 0.95 0.975 0.98 0.99 0.995 0.9975 0.999 


1 
2 
3 
4 
5 
6 
7 
8 
9 


50% 70% 80% 90% 95% 96% 98% 99% 99.5% 99.8% 
Confidence level 


Appendix D.5: F Distribution Quantiles 715 


D.5| F Distribution Quantiles 


If X ~ F(ndf,ddf), then we can use Table D.5 to obtain some quantiles for this 
distribution. For example, if ndf = 3,ddf = 4, and P = 0.975, then x0.975 = 9.98 
is the 0.975 quantile of the F (3, 4) distribution. Note that if X ~ F(ndf, ddf), then 
Y =1/X ~ F(ddf,ndf) and P(X < x) = P(Y > 1/x). 


Table D.5 F (ndf, ddf) Quantiles 
1 2 3 4 5 6 
39.86 49.50 53.59 55.83 57.24 58.20 
161.45 199.50 215.71 224.58 230.16 233.99 
647.79 799.50 864.16 899.58 921.85 937.11 
4052.18 4999.50 5403.35 5624.58 5763.65 5858.99 
405284.07 499999.50 540379.20 562499.58 576404.56 585937.11 
8.53 9.00 9.16 9.24 9.29 9.33 
18.51 19.00 19.16 19.25 19.30 19.33 
38.51 39.00 39.17 39.25 39.30 39.33 
98.50 99.00 99.17 99.25 99.30 99.33 
998.50 999.00 999.17 999.25 999.30 999.33 
5.54 5.46 5.39 5.34 5.31 5.28 
10.13 9.55 9.28 9.12 9.01 8.94 
17.44 16.04 15.44 15.10 14.88 14.73 
34.12 30.82 29.46 28.71 28.24 27.91 
167.03 148.50 141.11 137.10 134.58 132.85 
4.54 4.32 4.19 4.11 4.05 4.01 
7.71 6.94 6.59 6.39 6.26 6.16 
12.22 10.65 9.98 9.60 9.36 9.20 
21.20 18.00 16.69 15.98 15.52 15.21 
74.14 61.25 56.18 53.44 51.71 50.53 
4.06 3.78 3.62 3.52 3.45 3.40 
6.61 5.79 5.41 5.19 5.05 4.95 
10.01 8.43 7.76 7.39 7A5 6.98 
16.26 13.27 12.06 11.39 10.97 10.67 
47.18 37.12 33.20 31.09 29.75 28.83 
3.78 3.46 3.29 3.18 3.11 3.05 
5.99 5.14 4.76 4.53 4.39 4.28 
8.81 7.26 6.60 6.23 5.99 5.82 
13.75 10.92 9.78 9.15 8.75 8.47 
35.51 27.00 23.70 21.92 20.80 20.03 
3.59 3.26 3.07 2.96 2.88 2.83 
5.59 4.74 4.35 y 3.97 3.87 
8.07 6.54 5.89 5.29 5.12 
12.25 9.55 8.45 746 7.19 
29.25 21.69 18.77 16.21 15.52 


716 Appendix D: Tables 


Table D.5 F (ndf, ddf) Quantiles (continued) 
7 8 12 
58.91 59.44 59.86 60.19 60.47 60.71 
236.77 238.88 240.54 241.88 242.98 243.91 
948.22 956.66 963.28 968.63 973.03 976.71 
5928.36 5981.07 6022.47 6055.85 6083.32 6106.32 
592873.29  598144.16  602283.99 — 605620.97  608367.68 — 610667.82 
9.35 9.37 9.38 9.39 9.40 9.41 
19.35 19.37 19.38 19.40 19.40 19.41 
39.36 39.37 39.39 39.40 39.41 39.41 
99.36 99.37 99.39 99.40 99.41 99.42 
999.36 999.37 999,39 999.40 999.41 999.42 
5.27 5.25 5.24 5.23 5.22 5.22 
8.89 8.85 8.81 8.79 8.76 8.74 
14.62 14.54 14.47 14.42 14.37 14.34 
27.67 27.49 27.35 27.23 27.13 27.05 
131.58 130.62 129.86 129.25 128.74 128.32 
3.98 3.95 3.94 3.92 3.91 3.90 
6.09 6.04 6.00 5.96 5.94 5.91 
9.07 8.98 8.90 8.84 8.79 8.75 
14.98 14.80 14.66 14.55 14.45 14.37 
49.66 49.00 48.47 48.05 47.70 47.41 
3.37 3.34 3.32 3.30 3.28 3.27 
4.88 4.82 4.77 4.74 4.70 4.68 
6.85 6.76 6.68 6.62 6.57 6.52 
10.46 10.29 10.16 10.05 9.96 9.89 
28.16 27.65 27.24 26.92 26.65 26.42 
3.01 2.98 2.96 2.94 2.92 2.90 
4.21 4.15 4.10 4.06 4.03 4.00 
5.70 5.60 5.52 5.46 5.41 5.37 
8.26 8.10 7.98 7.87 7.79 7.72 
19.46 19.03 18.69 18.41 18.18 17.99 
2.78 2.75 2.72 2.70 2.68 2.67 
3.79 3.73 3.68 3.64 3.60 3.57 
4.99 4.90 4.82 4.76 471 4.67 
6.99 6.84 6.72 6.62 6.54 6.47 
15.02 14.63 14.33 14.08 13.88 13.71 


Appendix D.5: F Distribution Quantiles 717 


Table D.5 F(ndf, ddf) Quantiles (continued) 

ddf P 15 20 30 60 120 10000 
61.22 61.74 62.26 62.79 63.06 63.32 
245.95 248.01 250.10 252.20 253.25 254.30 
984.87 993.10 1001.41 1009.80 1014.02 1018.21 
6157.28 6208.73 6260.65 6313.03 6339.39 6365.55 
615763.66 620907.67 626098.96  631336.56 633972.40  636587.61 
9.42 9.44 9.46 9.47 9.48 9.49 
19.43 19.45 19.46 19.48 19.49 19.50 
39.43 39.45 39.46 39.48 39.49 39.50 
99.43 99.45 99.47 99.48 99.49 99.50 
999.43 999.45 999.47 999.48 999.49 999.50 
5.20 5.18 5.17 5.15 5.14 5.13 
8.70 8.66 8.62 8.57 8.55 8.53 
14.25 14.17 14.08 13.99 13.95 13.90 
26.87 26.69 26.50 26.32 26.22 26.13 
127.37 126.42 125.45 124.47 123.97 123.48 
3.87 3.84 3.82 3.79 3.78 3.76 
5.86 5.80 5.75 5.69 5.66 5.63 
8.66 8.56 8.46 8.36 8.31 8.26 
14.20 14.02 13.84 13.65 13.56 13.46 
46.76 46.10 45.43 44.75 44.40 44.06 
3.24 3.21 3.17 3.14 3.12 3.11 
4.62 4.56 4.50 4.43 4.40 4.37 
6.43 6.33 6.23 6.12 6.07 6.02 
9.72 9.55 9.38 9.20 9.11 9.02 
25.91 25.39 24.87 24.33 24.06 23.79 
2.87 2.84 2.80 2.76 2.74 2.72 
3.94 3.87 3.81 3.74 3.70 3.67 
5.27 5.17 5.07 4.96 4.90 4.85 
7.56 7.40 7.23 7.06 6.97 6.88 
17.56 17.12 16.67 16.21 15.98 15.75 
2.63 2.59 2.56 2.51 2.49 2.47 
3.51 3.44 3.38 3.30 3.27 3.23 
4.57 4.47 4.36 4.25 4.20 4.14 
6.31 6.16 5.99 5.82 5.74 5.65 
13.32 12.93 12.53 12.12 11.91 11.70 


718 Appendix D: Tables 


Table D.5 F(ndf, ddf) Quanties (continued) 


2.73 2.67 

3.69 3.58 

4.82 4.65 

6.63 6.37 

14.39 13.48 12.86 
2.69 2.61 25 

3.63 3.48 3.37 

4.72 4.48 4.32 

6.42 6.06 5.80 

12.56 11.71 11.13 
2.61 2.52 2.46 

3.48 3.33 3.22 

4.47 4.24 4.07 

5.99 5.64 5.39 

11.28 10.48 9.93 
2.54 2.45 2.39 

3.36 3.20 3.09 

4.28 4.04 3.88 

5.67 5:32. 5.07 

10.35 9.58 9.05 
2.48 2.39 2.33 

3.26 3.11 3.00 

4.12 3.89 3.73 

5.41 5.06 4.82 

9.63 8.89 8.38 

2.43 2.35 2.28 

3.18 3.03 2.92 

4.00 3.77 3.60 

5.21 4.86 4.62 

9.07 8.35 7.86 

2.39 2.31 2.24 

3.11 2.96 2.85 

3.89 3.66 3.50 

5.04 4.69 4.46 

8.62 7.92 7.44 


Appendix D.5: F Distribution Quantiles 719 


Table D.5 F(ndf, ddf) Quantiles (continued) 


720 Appendix D: Tables 


Table D.5 F(ndf, ddf) Quantites (continued) 


Appendix D.5: F Distribution Quantiles 721 


Table D.5 F(ndf, ddf) Quantiles (continued) 
aaa ees | 


ddf P 


722 Appendix D: Tables 


Table D.5 F (ndf, ddf) Quantiles (continued) 
i o a o 


Appendix D.5: F Distribution Quantiles 723 


Table D.5 F (ndf, ddf) Quantiles (continued) 
ddf P 15- 20 120 10000 


724 Appendix D: Tables 


D.6 | Binomial Distribution Probabilities 


If X ~ Binomial(n, p) , then Table D.6 contains entries computing 


P(X =k) = ( i Ja — p)" 


for various values of n, k, and p. 
Note that if X ~ Binomial(n, p), then P(X = k) = P(Y = n — k), where 
Y =n — X ~ Binomial(n, 1 — p). 


Table D.6 Binomial Probabilities 


P 


ee ee | 
n k o 203. 04 o o o o 0 
2 9801 9604 9409 9216 9025 8836 8649 8464 828I 


> 


BRWNK SO 


0 
1 
2 
3 
4 
5 


an 


N 
DURUN = O 


ADNRWNK SO 


AADNMNAWNK SO 


Appendix D.6: Binomial Distribution Probabilities 725 


Table D.6 Binomial Probabilities (continued) si 
a Se Se ee es a Ti 


WNrFO 


RWNF SO 


ARWNH OS 


0 
1 
2 
3 
4 
5 
6 


ADNAKWNK SO 


CIDNABRWNK SO 


726 Appendix D: Tables 


Table D.6 Binomial Probabilities (continued) č ] 
es ee a e a a a 


OCOANDNABRWNK O| 


pat 
© 


0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
0 


CONDUNARWNK CO 


Appendix D.6: Binomial Distribution Probabilities 727 


Table D.6 Binomial Probabilities (continued) si 
le ee eee ee Se a 


COCOANDNABRWNK O| 


m 
© 


0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
0 


CONDNARWNK CO 


728 Appendix D: Tables 


Table D.6 Binomial Probabilities (continued) ss 
a ees ee a ed 


Appendix E 


Answers to Odd-Numbered 
Exercises 


Answers are provided here to odd-numbered exercises that require a computation. 
No details of the computations are given. If the Exercise required that something be 
demonstrated, then a significant hint is provided. 

1.2.1 (a) P({1, 2}) = 5/6 (b) P({1, 2, 3}) = 1 (©) PIJ = P({2, 3}) = 1/2 
1.2.3 P({2}) = 1/6 

1.2.5 P({s}) = 0 for any s e [0, 1] 

1.2.7 This is the subset (A N B®) U (ASN B). 

1.2.9 P({1}) = 1/12, P({2}) = 1/12, P({3}) = 1/6, P({4}) = 2/3 

1.2.11 P({2}) = 5/24, P({1}) = 3/8, P({3}) = 5/12 

1.3.1 (a) P({2, 3, 4, ..., 100}) = 0.9 (b) 0.1 

1.3.3 P (late or early or both) = 25% 

1.3.5 (a) 1/32 = 0.03125. (b) 0.96875 

1.3.7 10% 

1.4.1 (a) (1/6)® = 1/1,679,616 (b) (1/6)’ = 1/279,936 (c) 8 (1/6) = 1/209,952 
1.4.3 1 — 5051/2! 

1.4.5 (a) (G3) (is i 13)/ (13 313 13) (b) (i) (3) (3) (13 3 3) / (13 313 13) 

1.4.7 (78) / GR) = 246/595 = 0.4134 

1.4.9 (5/6)? (1/6) = 25/216 

1411 (8) /()) (0/8) 0/0) (6) 

1.4.13 OA Os Ott OR O - G) ae = tap = 0.0859 

1.5.1 (a) 3/4 (b) 16/21 

1.5.3 (a) 1/8 (b) (1/8)/ (1/2) = 1/4 (c) 0/(1/2) = 0 

1.5.5 1 

1.5.7 0.074 


729 


730 Appendix E: Answers to Odd-Numbered Exercises 


1.5.9 (a) No (b) Yes (c) Yes (d) Yes (e) No 

1.5.11 (a) 0.1667 (b) 0.3125 

1.6.1 1/3 

1.6.3 {An} Z A= {1,2,3,...} =S 

1.6.5 1 

1.6.7 Suppose there is no n such that P([0,n]) > 0.9 and then note this implies 
1 = P([0, co)) = limao P ([0, n]) < 0.9. 

1.6.9 No 

2.1.1 (a) 1 (b) Does not exist (c) Does not exist (d) 1 

2.1.3 (a) X(s) = s and Y(s) = s? for all s € S. (b) For this example, Z(1) = 
2, Z(2) = 18, Z(3) = 84, Z(4) = 260, Z(5) = 630. 

2.1.5 Yes, for AN B. 

2.1.7 (a) W(1) = 2 (b) W (2) = 0 (c) W(3) = —1 (d) W > Z is not true. 

2.1.9 (a) YA) = 1 (b) Y2 = 4(c) Y(4) = 0 

2.2.1 P(X = 0) = P(X = 2) = 1/4,P(X = 1) = 1/2, P(X = x) = 0 for 
x #0,1,2 

2.2.3 (a) P(Y = y) = O for y Æ 2,3,4,5,6,7, 8,9, 10, 11,12, P(Y = 2) 
1/36, P(Y = 3) = 2/36, P(Y = 4) = 3/36, P(Y = 5) = 4/36, P(Y = 6) 
5/36, P(Y = 7) = 6/36, P(Y = 8) = 5/36, P(Y = 9) = 4/36, P(Y = 10) = 
3/36, PY = 11) = 2/36, P(Y = 12) = 1/36 (b) P(Y e B) = (1/36)Ig(2) + 
(2/36) Iz (3)+(3/36)IB (4)+(4/36) lz (5)+(5/36) 1p (6)+(6/36) 13 (7) +(5 /36) IB (8)+ 
(4/36) Ip (9) + (3/36)Ig (10) + (2/36)Ig (11) + (1/36) 73 (12) 

2.2.5 (a) P(X = 1) = 0.3, P(X = 2) = 0.2, P(X = 3) = 0.5, and P(X = x)= 0 
for all x ¢ {1,2,3} (b) P(Y = 1) = 0.3, P(Y = 2) = 0.2, P(Y = 3) = 0.5, and 
P(Y = y) = 0 for all y ¢ {1,2,3} (c) POW = 2) = 0.09, P(W = 3) = 0.12, 
P(W = 4) = 0.34, P(W = 5) = 0.2, P(W = 6) = 0.25, and P(W = w) = 0 for all 
other choices of w. 

2.2.7 P(X = 25) = 0.45, P(X = 30) = 0.55, and P(X = x) = 0 otherwise 

2.3.1 py (2) = 1/36, py (3) = 2/36, py (4) = 3/36, py (5) = 4/36, py (6) = 5/36, 
Py (7) = 6/36, py (8) = 5/36, py (9) = 4/36, py (10) = 3/36, py (11) = 2/36, 
py (12) = 1/36, and py (y) = 0 otherwise 

2.3.3 pz(1) = pz(5) = 1/4, pz(O) = 1/2, and pz(z) = 0 otherwise 

2.3.5 pw(1) = 1/36, pw) = 2/36, pw) = 2/36, pw(4) = 3/36, pw) 
2/36, pw(6) = 4/36, pw(8) = 2/36, pw9) = 1/36, pw(10) = 2/36, pw(12) 
4/36, pwU15) = 2/36, pw(16) = 1/36, pw(18) = 2/36, pw (20) = 2/36, pw (24) = 
2/36, pw(25) = 1/36, pw(30) = 2/36, and pw(36) = 1/36, with pw(w) = 0 
otherwise 

2.3.7 0 = 11/12 

2.3.9 53/512 

2.3.11 61° 

2.3.15 (a) (')) (0.35)? (0.65)7 (b) (0.35) (0.65)? (c) (?) (0.35)? (0.65)8 


Appendix E: Answers to Odd-Numbered Exercises 731 


2.3.17 (a) Hypergeometric(9, 4, 2) (b) Hypergeometric(9, 5, 2) 

2.3.19 P(X = 5) © ((100/1000)° /5!) exp {—100/1000} 

2.4.1 (a) 0 (b) 0 (c) 0 (d) 2/3 (e) 2/3 (f) 1 (g) = 1 

2.4.3 (a) e~29 (b) 1 (c) e7! (d) e7425" 

2.4.5 No 

2.4.7 c = 3/M? 

2.4.9 [> f@)dx > f? g&æ)dx 

2.4.11 Yes 

2.4.13 P(Y < 3) = f? 2r)! exp( -0 -—1)?/2)dy = [2 2r)" exp(—u?/2) 
du = P(X <2) 


2.5.1 Properties (a) and (b) follow by inspection. Properties (c) and (d) follow since 
Fyx(x) = 0 for x < 1/6, and Fy (x) = 1 for x > 1. 


2.5.3 (a) No (b) Yes (c) Yes (d) No (e) Yes (£) Yes (g) No 

2.5.5 Hence: (a) 0.933 (b) 0.00135 (c) 1.90 x 1078 

2.5.7 (a) 1/9 (b) 3/16 (c) 12/25 (d) 0 (e) 1 (f) 0 (g) 1 (h) 0 

2.5.9 (b) No 

2.5.11 (b) Yes 

2.5.13 (b) The function F is nondecreasing, limx— -o F(x) = 0 and limy4 F(x) = 
1. (c) P(X > 4/5) = 0, P(—1 < X < 1/2) = 3/4, P(X = 2/5) = 5/12, P(X = 
4/5) = 1/4 

2.5.15 (a) P (Z > 4/5) = 2e7!6/?5/3 (b) P(—1 < Z < 1/2) = 11/12 — 2e7!/2/3 (c) 
P(Z = 2/5) = 5/36 (d) P (Z = 4/5) = 1/12 (e) P (Z = 0) = 1/9 (f) P (Z = 1/2) = 
11/12 — 2e7!/2/3 

2.6.1 fy (y) equals 1/(R — L)c for L < (y — d)/c < R and otherwise equals 0. 

2.6.3 fy(y) = eden) /20°0* tog Iq 

2.6.5 fy (y) equals (4/3)y 723 e749"? for y > 0 and otherwise equals 0. 

2.6.7 fy(y) = 1/6y!/? fr0 < y <9 

2.6.9 (a) fy (y) = y/8 b) fz(z) =z! /2 for 0 < z < V2 

2.6.11 fy (y) = y7! sin(y!/2)/4 for 0 < y < z? and 0 otherwise 

2.6.13 fy O) = 22) Bly exp ly? /2) 


2.7.1 
0 min[x, (y +2)/4] <0 
Fy y(x, y) = 1/3 0 < min[x, (y + 2)/4] < 1 
1 min[x, (y +2)/4] > 1 


2.7.3 (a) px (2) = px(3) = px(—3) = px(—2) = px(17) = 1/5, with px(x) = 0 
otherwise (b) py (3) = py (2) = py(—2) = py(—3) = py (19) = 1/5, with py (y) = 
0 otherwise (c) P(Y > X) = 3/5 (d) P(Y = X) = 0 (e) P(XY <0) = 0 


2.7.5 {X < x, Y < y} C {X < x} and {X <x, Y < y} C {Y < y} 


732 Appendix E: Answers to Odd-Numbered Exercises 


2.7.7 (a) fx(x) = c(l — cos(2x))/x for 0 < x < 1 and 0 otherwise (b) fy(y) = 
c(1 — cos(y))/y for 0 < y < 2 and 0 otherwise 

2.7.9 (a) fx(x) = (4 + 3x? — 2x3)/8 for x € (0,2) and 0 otherwise (b) fy(y) = 
(y3 + 3y7)/12 for y € (0, 2) and 0 otherwise (c) P(Y < 1) = 5/48 

2.8.1 (a) px(—2) = 1/4, px) = 1/4, px(13) = 1/2; otherwise, px(x) = 0 (b) 
Py (3) = 2/3, py(5) = 1/3; otherwise, py (y) = 0 (c) Yes 

2.8.3 (a) fx(x) = (18x /49) + (40/49) for 0 < x < 1 and fy(x) = 0 otherwise (b) 
fy (y) = (48y? + 6y + 30)/49 for 0 < y < 1 and fy(y) = 0 otherwise (c) No 

2.8.5 (a) P(Y = 4| X = 9) = 1/6 (b) P(Y = —2| X = 9) = 1/2 (c) PY =0|X = 
—4) = 0 (d) P(Y = —2| X = 5) = 1 (e) P(X = 5| Y = -2) = 1/3 

2.8.7 (a) fx(x) = x? + 2/3, fy (y) = 4y% + 2y/3 forO < x < landO < y < 1, 
frixO |x) = (2x?y + 4y°) / (x? + 2/3) (otherwise, fyjx(y |x) = 0), thus, X and 
Y are not independent. (b) fx(x) = C(x°/6 + x/2), fy (y) = C(y5/6 + y/2) for 
O<x <land0 < y <1, fyxOlx) = (xy + x5y5)/(x5/6 + x/2) (other- 
wise, fy|x(y|x) = 0) X and Y are not independent. (c) fy(x) = C (500,000x5/3 + 
50x), fy) = C(2048y>/3 + 8y) forO < x < 4and 0 < y < 10, frixQ |x) = 
(xy +x5y5) / (500,000x° /3 + 50x) (otherwise, fy|x © | x) = 0), thus, X and Y are not 
independent. (d) fx (x) = C(500,000x°/3) and fy(y) = C(2048y°/3) for0 < x <4 
and 0 < y < 10, fyjx(y |x) = 3y> / 500,000 (otherwise, fy\x(y |x) = 0), X and Y 
are independent. 

2.8.9 P(X =1,Y=)= P(X = 1, Y =2)= P(X =2, Y=) = P(X =3, Y= 
3) = 1/4 

2.8.11 If X = C is constant, then P(X €e Bı) = Ig (C) and P(X € Bi, Y € B2) = 
Ig (C) P(Y (= B2). 


2.8.13 (a) 
prxOlx) | y=1 y=2 y=4 y=7 Others 
x=3 14 14 1/4 ~ &«221/74 0 
x=5 1/4 1/4 1⁄4 1⁄4 0 
(b) 


pxiy(xly) | x=3 x=5 Others 
y= 1/2 1/2 (0) 
y=2 1/2 1/2 
y=4 1/2 1/2 
y=7 1/2 1/2 


ooo 


(c) X and Y are independent. 


2.8.15 fyjx lx) = 2(x? + y)/(4 + 3x? — 2x3) for x < y < 2, and 0 otherwise (b) 
fx ly) = 3(x* + y)/(y3+3y’) for 0 < x < y and 0 otherwise (c) Not independent 


2.9.1 EL = — cos(2x u2) / u1 y/2log(1/u1), St = —2m sin(2xu2),/2 log(1/u1), 
Z = — sin(2mur) / u1 y2 log(1/u1), $2 = —2x cos(2mur) x y/2log(1/u1) 


2.9.3 (b) h(x, y) = x? +y’, x? — y?) CA 1, w) = (VRF w)/2, JE — w)/2), 
at least for z + w > Oandz—w > 0 (d) fz,w (z, w) = e VETO? /2,./2? = w? for 


Appendix E: Answers to Odd-Numbered Exercises 733 


V(z+w)/2 > Oand 1 < ~y(z — w)/2 < 4, i.e., for z > 4 and max(—z, z — 64) < 
w < z — 4, and 0 otherwise 

2.9.5 (b) h(x, y) = O4, x4) (©) hF, w) = (w4, 2!) @ fz,w E, w) = e™” 
for w!/4 > Oand 1 < z!/4 < 4, i.e., for w > Oand 1 < z < 256, and 0 otherwise 
2.9.7 pz(2) = 1/18, pz(4) = 1/12, pz(5) = 1/18, pz (7) = 1/24, pz(8) = 1/72, 
pzQ) = 1/4, pz(11) = 3/8, pz(12) = 1/8, pz(z) = 0 otherwise 

2.9.9 

(a) 


1/4 


(z, w) (—8,16) (—7,19) (—3,11) (—2,14) (0,6) otherwise 


P(Z=z,W=w)| 1/5 1/5 1/5 1/5 ‘1/5 0 


(b) pz(z) = 1/5 for z = —8, —7, —3, —2, 0, and otherwise pz(z) = 0 (c) pw(w) = 
1/5 for w = 6, 11, 14, 16, 19, and otherwise pw(w) = 0 

2.10.1 Z = —7 if U < 1/2, Z = —2if 1/2 < U < 5/6, and Z = 5if U > 5/6 
2.10.3 Y ~ Exponential (3) 

2.10.5 ci = +3V2 and cp = 5 

2.10.7 (a) For x < 1, Fx(x) = 0, for 1 < x < 2, Fx(x) = 1/3, for2 < x < 4, 
Fx(x) = 1/2, for x > 4, Fy(x) = 1. (b) The range of t must be restricted on (0, 1] 
because Fy '(0) = —o0. Fz (t) = 1 fort € (0, 1/3], Fy (Œ) = 2 for t € (1/3, 1/2], 
and Fy'() = 4 fort e (1/2,1]. (c) For y < 1, Fy(y) = 0, fr 1 < y < 2, 
Fy (y) = 1/3, for 2 < y < 4, Fy (y) = 1/2, for y > 4, Fy) = 1. 

2.10.9 Y = F7 ' (U) = U!⁄4 

3.1.1 (a) E(X) = 8/7 (b) E(X) = 1 (c) E(X) = 8 

3.1.3 (a) E(X) = —1 (b) E (Y) = 11 (c) E(X?) = 19 (d) E (Y?) = 370/3 (e) E(X? + 
Y?) = 427/3 (f) E(XY — 4Y) = —113/2 

3.1.5 E(8X — Y + 12) = 8((1 — p)/p)— å + 12 

3.1.7 E(XY) = 30 

3.1.9 E(X) =6 

3.1.11 (a) E(Z) = 7 (b) E(W) = 49/4 

3.1.13 E(Y) =7/4 

3.2.1 (a) C = 1/4, E(X) = 7 (b) C = 1/16, E(X) = 169/24 (c) C = 5/3093, 
E(X) = —8645/2062 

3.2.3 (a) E(X) = 17/24 (b) E(Y) = 17/8 (c) E(X?) = 11/20 (d) E(¥?) = 99/20 (e) 
E(Y*) = 216/7 (f) E(X?Y?) = 27/4 

3.2.5 E(—5X — 6Y) = —77/3 

3.2.7 E(Y + Z) = 17/72 

3.2.9 Let uw, = E(X*), then u; = 39/25, wy = 64/25, u3 = 152/35. 

3.2.11 334 

3.2.13 E(Y) = 214.1 

3.2.15 Yes 


734 Appendix E: Answers to Odd-Numbered Exercises 


3.3.1 (a) Cov(X, Y) = 2/3 (b) Var(X) = 2, Var(Y) = 32/9 (c) Corr(X, Y) = 1/4 
3.3.3 Corr(X, Y) = —0.18292 

3.3.5 E(XY) = E(X)E(Y) 

3.3.7 (a) Cov(X, Z) = 1/9 (b) Corr(X, Z) = 1/46 

3.3.9 E (X (X — 1)) = E (X?) — E (X), when X ~ Binomial (n, 0) , E (X (X — 1)) = 
n (n — 1)8? 

3.3.11 E(X) = 7/2, E(Y) = 7, E(XY) = 329/12, Cov(X, Y) = 35/12 

3.3.13 Cov(Z, W) = 0, Corr(Z, W) = 0 

3.3.15 Cov(X, Y) = 35/24 

3.4.1 (a) rz(t) = t/2 — t), r3) = 2/2 — t}, r40) = —4/(t — 2)3 (b) E(Z) = 2, 
Var(Z) = 2, mz (t) = e'/(2—e'), mz (t) = 2e! /(2—e')*, m, (t) = 2e (2+e")/(2—-e') 
3.4.3 my (s) = 2@-), m (s) = dese*@—Y, so mi (s) = A, m% (s) = Ges + 
ee), so mt (s) = A+ A, Var(Y) = 2 +2? — 4} = 14 

3.4.5 my (s) = em x Bs) 

3.4.7 m) (s) = eE Des AC + 3e5A + e757), EY?) = mY (0) = A1 +34 +14?) 
3.5.1 (a) E(X |Y = 3) = 5/2 (b) E(Y|X = 3) = 22/3 (© E(X|Y = 2) = 
5/2, E(X |Y = 17) = 3 (d) E (Y | X = 2) = 5/2, E(Y | X = 3) = 22/3 

3.5.3 (a) E(Y |X = 6) = 25/4 (b) E(Y |X = —4) = 36/7 (c) E(Y |X) = 25/4 
whenever X = 6, and E (Y | X) = 36/7 whenever X = —4. 

3.5.7 E(Z|W = 4) = 14/3 (b) E(W|Z = 4) = 10/3 

3.5.9 (a) E(X|Y = 0) = 1 (b) E(X|Y = 1) = 2 (c) E(Y|X = 0) = 0 (d) E(Y|X = 
1) = 1/3 (e) E(Y|X = 2) = 2/3 (f) E(Y|X = 3) = 1 (g) E(Y|X) = X/3 

3.5.11 (a) E(X) = 4 (b) EY) = 2% © E(XIY = y) = 32+ y?)/4 + 3y°) (d) 
E(YIX = x) = (02/2 + 1/5)/G? + 1/4) ©) ELE(XIY)] = h SEP. a+ 
ay dy = H (Ð ELEWIX)] = fo SEES. $0? + Dax = Z. 

3.6.1 3/7 


3.6.3 (a) 1/9 (b) 1/2 (c) 2 (d) The upper bound in part (b) is smaller and thus more 
useful than that in part (c). 


3.6.5 1/4 

3.6.7 (a) 10,000 (b) 12,100 

3.6.9 (a) 1 (b) 1/4 

3.6.11 (a) E(Z) = 8/5 (b) 32/75 

3.6.13 7/16 

3.7.1 E(X1) = 3, E(X2) = 0, E(Y) = 3/5 

3.7.3 P(X <t)=Ofort <0, while P(X > t)=1for0 <t < Cand P(X > t)=0 
fort > C 

3.7.5 E(X) =2 

3.7.7 E(W) = 1/5 


Appendix E: Answers to Odd-Numbered Exercises 735 


3.7.9 E(W) = 21/2 

4.1.1 P(Y3 = 1) = 1/8, P(Y¥3 = 2) = 1/64, P(Y; = 3) = 1/64, P(¥3 = 21/3) 
3/16, P(Y3 = 31/3) = 3/16, P(Y3 = 41/3) = 3/32, P(¥; = 91/3) = 3/32, P (Y3 
12!/3) = 3/64, P(¥3 = 18/3) = 3/64, P (Y3 = 6!/3) = 3/16 

4.1.3 If Z is the sample mean, then P(Z = 0) = p°, P(Z = 0.5) = 2p(1 — p), and 
P(Z=1)=(1- p}. 

4.1.5 For 1 < j < 6, P(max = j) = (j/6)*9 — (Vj — 1)/6)”". 

4.1.7 If W = XY, then 


1/36 if w = 1,9, 16, 25, 36, 
1/18 if w = 2, 3,5, 8, 10, 15, 18, 20, 24, 30, 


PW=w)= 1/12 if w = 4, 
1/9 if w = 6, 12, 
(0) otherwise. 


4.1.9 py (y) = 1/2 for y = 1, 2; otherwise, py (y) = 0 

4.2.1 Note that Z, = Z unless 7 < U < 7+1/n?. Hence, for any € > 0, P ((Zn— Z| > 
€< P(1 <U <7+1/n°) = 1/5n? > Oasn > œ. 

4.2.3 P(Wi +--+ Wn < n/2) = 1 — P(Wi +-+- + Wn > n/2) > 1— PUL(Wi + 
"+ Wa) — 31 2 1/6) 

4.2.5 P(X, +--- +X, > 9n) < P((X1 +--+ Xn)/n — 8| > 1) 

4.2.7 For alle > O andn > —2Ine, P (Xn — Y| > €) = P(e7™ > €) = P(H, < 
—Ine) < P(\A, — n/2| > |n/2+Inel)n > œ. 

4.2.9 By definition, H, — 1 < Fy < Hn, and P(|Xn — Yn — Z| > €) = P(|Hn — 
Fal/(Hn + 1) 2 €) < P(/(An + 1) > €) = P(H, < Q/6)— 1) = P(H, —- n/2 < 
(1/e) — 1 —n/2) < P((Hn — n/2| > |1 +n/2—1/e|). 

4.2.11 r = 9/2 

4.3.1 Note that Z, = Z unless 7 < U < 7 + 1/n?. Also, if U > 7, then Z, = Z 
whenever 1/n? < 7 — U, i.e, n > 1//7—U. Hence, P (Z, > Z) > P(U #7). 
4.3.3 (W1 +--+ + Wn)/n > 1/3} © (n; (Wi +- + W,)/n < 1/2} = {A 
n; W +--+ Wn <n/2} 

4.3.5 P(X, > X and Y, > Y) = 1 — P(X, ® Xor Y, HY) > 1— P(X, P 
X)— Pnr Y) 

4.3.7m=5 

4.3.9 r = 9/2 


4.3.11 (a) Suppose there is no such m and from this get a contradiction to the strong 
law of large numbers. (b) No 


4.4.1 limn P(Xn =i) = 1/3 = P(X =i) fori = 1,2,3 


4.4.3 Here, P (Z, < 1) = 1, for 0 < z < 1, P(Z, < z) = z"t!, and P(Z < z) = 1 
forz > 1. 


4.4.5 P(S < 540) ~ @(1/2) = 0.6915 
4.4.7 P(S > 2450) ~ ©(—0.51) = 0.3050 


736 Appendix E: Answers to Odd-Numbered Exercises 


4.4.9 (a) For 0 < y < 1, P(Z < y) = y*. (b) Forl < m < n, P(X, < m/n) = 
m(m + 1)/[n(n + 1)]. (c) For0 < y < 1, letm = [ny], the biggest integer not greater 
than ny. Since there is no integer in (m, ny), P(m/n < Xn < y) < P(m/n < Xn < 
(m + 1)/n) = 0. Thus, P(X, < y) = m(m + 1)/[n(n + 1)], where m = [ny]. 
(d) Fo 0 < y < 1, let m, = |ny], show m,/n —> y asn — oo. Then show 
P(X; < y) 3 y asn > oo. 

4.4.11 1=3 

4.4.13 The yearly output, Y, is approximately normally distributed with mean 1300 
and variance 433. So, P(Y < 1280) ~ 0.1685. 

4.5.1 The integral equals V27 E (cos? (Z)), where Z ~ N(0, 1). 

4.5.3 This integral equals (1/5) E(e7!42’), where Z ~ Exponential(5). 

4.5.5 This sum is approximately equal to eñ E(sin(Z)), where Z ~ Poisson(5). 

4.5.7 (—6.1404, —3.8596) 

4.5.9 (0.354, 0.447) 

4.5.11 (a) C = {Jà fy g(x, y) dx dy}~! (b) Generate X;’s from fy (x) = 3x? for 0 < 
x < laand Y;’s from fy(y) = 4y? for 0 < y < 1. Set D; = sin(X;Y;) cos(./X;¥;) x 
exp(X? + Y;)/12 and N; = X;- Di fori = 1,...,n. 5. Estimate E(X) by Mn = 
N/D= (Ni +++ +Nn)/(Di +--+ Dn). 

4.5.13 (a) J = fy [0° e h(x, y) o,o) (y)e~” dy Io,1y(x) dx (b) Generate X; and Y; 
appropriately, set T; = eř h(X;, Y;), and estimate J by Mn = (Ti +--- + T)/n. 
c) J = ake Jo. e AC, y) It0,00)(y)5e~>” dy Ito, (x) dx (d) As in part (b). (e) The 
estimator having smaller variance is better. So use sample variances to choose between 
them. 


4.6.1 (a) U ~ N(43, 629), V ~ N(—18 — 8C, 144 + 25C?) (b) C = —24/125 

4.6.3 Cy = 1/5, C2 = —3, C3 = 1/2, C4 = 7, C5 = 2 

4.6.5 Let Z1, ..., Zn, W1, .-., Wm ~ N(O, 1) bei.i.d. and set X = (Z1) +- - -+ (Zn)? 
and Y = (W1)? +---+(W,)?. 

4.6.7 C = yn 

4.6.9 Cy = 2/5, Cz = —3, C3 = 2, C4 = 7, C5 = 2, C6 = 1, C7 = 1 

4.6.11 (a) m = 60, K = V61 (b) y = 1.671 (c) a = 61, b = 1, c = 60 (d) w = 4.00 
5.1.1 The mean survival times for the control group and the treatment group are 93.2 
days and 356.2 days, respectively. 


5.1.3 For those who are still alive, their survival times will be longer than the recorded 
values, so these data values are incomplete. 


5.1.5 x = —0.1375 
5.1.7 Use the difference x — y. 
5.2.1 In Example 5.2.1, the mode is 0. In Example 5.2.2, the mode is 1. 


5.2.3 The mixture has density (5/./2z) exp{— (x + 4)? /2}+(5//2z) exp{— (x — 4)” 
/2} for —œ0 < x < 0. 


5.2.5 x = 10 


Appendix E: Answers to Odd-Numbered Exercises 737 


5.2.7 The mode is 1/3. 
5.2.9 The mode is x = 0. 


5.3.1 The statistical model for a single response consists of three probability functions 
{Bernoulli(1 /2), Bernoulli(1/3), Bernoulli(2/3)}. 


5.3.3 The sample (X1, . . . , Xn) is a sample from an N (u, ø?) distribution, where 0 = 
(u,a?) € Q = {(10,2), (8, 3)}. Both the population mean and variance uniquely 
identify the population distribution. 


5.3.5 A single observation is from an Exponential(@) distribution, where 0 € Q = 


[0, o0). We can parameterize this model by the mean or variance but not by the coeffi- 
cient of variation. 


5.3.7 (a) Q = {A, B} (b) The value X = 1 is observable only when 0 = A. (c) Both 
0 = A and 0 = B are possible. 


5.3.9 P; 
5.4.1 
0 x <l 4 
4 10 as, 
0 l<x<2 3 
10 AT 
Fy(x) = ih 2<x<3 ,fx@)= 2 — 
9 3< 4 10 i 
5 <x< 1 4 
10 x= 
1 4<x 


bx = Die tfx@) = 2,03 = (Lia x@))-P=1 

5.4.3 (a) Yes (b) Use Table D.1 by selecting a row and reading off the first three single 
numbers (treat 0 in the table as a 10). (c) Using row 108 of Table D.1 (treating 0 as 
10): First sample — we obtain random numbers 6, 0,9, and so compute (X (76) + 
X (10) + X (z9))/3 = 3.0. Second sample — we obtain random numbers 4, 0, 7, and 
so compute (X (m6) + X (x10) + X (a9))/3 = 2.6667. Third sample — we obtain 
random numbers 2, 0, 2 (note we do not skip the second 2), and so compute (X (76) + 
X (m10) + X (9))/3 = 2.0. 

5.4.5 (c) The shape of a histogram depends on the intervals being used. 

5.4.7 It is a categorical variable. 


5.4.9 (a) Students are more likely to lie when they have illegally downloaded music, so 
the results of the study will be flawed. (b) Under anonymity, students are more likely 
to tell the truth, so there will be less error. (c) The probability a student tells the truth 
is p = 0.625. Let Y; be the answer from student i. Then (Y — (1 — p))/(2p — 1) is 
recorded as an estimate of the proportion of students who have ever downloaded music 
illegally. 

5.5.1 (a) fx(0) = = 0.2667, fx) = = 0.2, AD = = 0.2667, fx@) = = fx(4) = = 0.1333 
(b) Fx (0) = 0.2667, Êx(1) = = 0.4667, Fx (2) = 4 7333, Fy) = = 0.8667, Fx (4) = 
1.000 (d) The mean ¥ = 1.667 and the variance s? = 1.952. (e) The median is 2 and 
the 7QR = 3. According to the 1.5 Z QR rule, there are no outliers. 


5.5.3 (a) fx (1) = 25/82, fx (2) = 35/82, fx (3) = 22/82 (b) No 


738 Appendix E: Answers to Odd-Numbered Exercises 


5.5.5 The sample median is 0, first quartile is —1.150, third quartile is 0.975, and the 
IQR = 2.125. We estimate Fy (1) by Fy(1) = 17/20 = 0.85. 


5.5.7 y (u) = u + .0020,25, where zo.25 satisfies ® (z9,25) = 0.25 
5.5.9 y(u) = O(3 — u) /o0) 
5.5.11 y(u, 0°) = O(83 — u) /o) 
5.5.13 y (0) = 20 (1 — 0) 
5.5.15 y (8) = ao/ß? 
6.1.1 The appropriate statistical model is Binomial(n, 0), where 0 € Q = [0, 1] is the 
probability of having this antibody in the blood. The likelihood function is L (8 |3) = 
(HBa -oy 

3 ; 
6.1.3 L(@|x1,...,x20) = 67° exp(— (20x) @) and x is a sufficient statistic. 

10) ,/9 

6.1.5 ¢ = (4)/(3) 
6.1.7 L@|x1,...,%n) = [ff 0%e7? /xi! = Oe" / T] xi! and x is a minimal 
sufficient statistic. 


6.1.9 L(1|0)/L(2|0) = 4.4817, the distribution fı is 4.4817 times more likely than fo. 
6.1.11 No 

6.1.13 No 

6.2.1 (1) = a, 4(2) = b, ÊG) = b, 6(4) =a 

6.2.3 y (80) = 0? is 1-1, and so y(A(x1, .... Xn)) = X? is the MLE. 

6.2.5 Ô = ao/x 

6.2.7 &@ = —n/ >, Inx; 

6.2.9 6 =n/>-?_, n(14+ x;) 

6.2.11 fu? = 32.768 cm? is the MLE 

6.2.13 A likelihood function cannot take negative values. 

6.2.15 Equivalent log-likelihood functions differ by an additive constant. 
6.3.1 P-value = 0.592 and 0.95-confidence interval is (4.442, 5.318). 
6.3.3 P-value = 0.000 and 0.95-confidence interval is (63.56, 67.94). 


6.3.5 P-value = 0.00034 and 0.95-confidence interval is [47. 617, 56.383]. The mini- 
mum required sample size is 2. 


6.3.7 P-value = 0.1138, so not statistically significant and the observed difference of 
1.05 — 1 = 0.05 is well within the range of practical significance. 


6.3.9 P-value = 0.527 

6.3.11 P-value = 0.014 

6.3.13 (a) X (a; —¥)* = DL, x? —nx? (b) The plug-in estimator is 67 = ¥(1 —¥), 
so 6° = s2(n — 1)/n. (c) bias(67) = —0?2/n > Oas n > oo. 

6.3.15 (a) Yes (b) No 


6.3.17 The P-value 0.22 does not imply the null hypothesis is correct. It may be that 
we have just not taken a large enough sample size to detect a difference. 


Appendix E: Answers to Odd-Numbered Exercises 739 


6.4.1 m3  z(147)/253/ V/T = (26.027, 151.373) 


6.4.3 The method of moments estimator is ,/m2 — m+ /my. If Y = cX, then E(Y) = 
cE(X) and Var(Y) = c? Var(X) . 

6.4.5 From the mgf, my (0) = 307 u+ u?. The plug-in estimator is 23 = 3 (m = mî) x 
mı + m$, while the method of moments estimator of u3 is m3 = 1 DA. 


6.4.7 The sample median is estimated by —0.03 and the estimate of the first quartile is 
—1.28, and for the third quartile is 0.98. Also F (2) = F (1.36) = 0.90. 


6.4.9 The bootstrap procedure is sampling from a discrete distribution and by the CLT 
the distribution of the bootstrap mean is approximately normal when n and m are large. 
The delta theorem justifies the approximate normality of functions of the bootstrap 
mean under conditions. 


6.4.11 The maximum number of possible values is 1 + (5) = 1+n(n — 1)/2. Here, 0 
is obtained when i = j. The bootstrap sample range y(n) — ya) has the largest possible 
value x(n) — x(1) and smallest possible value of 0. If there are many repeated x; values 
in the bootstrap sample, then the value 0 will occur with high probability for ym) — ya) 
and so the bootstrap distribution of the sample range will not be approximately normal. 


6.5.1 n/20* 
6.5.3 n/a? 
6.5.5 2/x + (2/¥) 20.95/V2n = (9.5413 x 1074, 1.5045 x 107°) 


6.5.7 & + (@/J/n)z(14y)/2 = (0.18123, 0.46403) as a 0.95-confidence interval, and 
this does not contain a = 1 + 1/25 = 1.04. 


6.5.9 [0, min(1 + %)7! + n71 A +8)! V8 +) tz, DI 
7.1.1 Based on m(1, 1) = 1/20 + 2/45 + 18/80 = 23/72, m(1, 2) = 1/20 + 4/45 + 
6/80 = 77/360, m(2, 1) = m(1, 2), m(2, 2) = 1/20 + 8/45 + 2/80 = 91/360 the 
posterior probability distributions for each of the four possible samples are as follows: 
sample d, 1) (1, 2) 
0=1 (1/4) C/5)/md, 1) = 18/115 (1/4) (1/5)/m(1,2) = 18/77 
0=2_ (1/9) (2/5)/m(, 1) = 16/115 = (2/9) (2/5) /m(, 2) = 32/77 
0=3 (9/16) (2/5)/m(, 1) = 81/115 (/16) (2/5)/m(1, 2) = 27/77 
sample (2, 1) (2, 2) 
0=1 (1/4) (1/5)/m(2, 1) = 18/77 (1/4) (1/5)/m(2, 2) = 18/91 
0=2_ (2/9) (2/5)/m(, 1) = 32/77 (4/9) (2/5)/m(2, 2) = 64/91 
0=3 (3/16) (2/5)/m(2, 1) = 27/77 (1/16) (2/5) /m(2, 2) = 9/91 
7.1.3 The prior probability that 0 is positive is 0.5, and the posterior probability is 
0.9992. 


DAS OR E Tee oO, gO e dO 

7.1.7 w|o7, x1,...,4n ~ N(S.5353, go’), 1/0? |x1,..., Xn ~ Gamma(11, 41.737) 
7.1.9 (a) (n + 1)0” I10.4,0.6)(0)/(0.6"*! — 0.4"+!) (b) No (c) The prior must be greater 
than 0 on any parameter values that we believe are possible. 

7.1.11 (a) (| = 0) = 1/6, 1qa| = 1) = 1/3, II(I0| = 2) = 1/3, 1(6| = 3) = 
1/6, so |9| is not uniformly distributed on {0, 1, 2, 3}. (b) No 


740 Appendix E: Answers to Odd-Numbered Exercises 


T(a+fh+n)0 (nx ) 
7.2.1 ACEC 7 = = 


7.2.3 E (1/0? aa mere) = (a0 +n/2)/f,, and the posterior mode is 1/67 = 
(ao +n/2 —1)/B,. 

7.2.5 As in Example 7.2.4, the posterior distribution of 0; is Beta( fı +a 1, f2 +--+ 
fir ta2+---+ax), 80 E (01 [x13 Xn) = TOF Dy ai tart D/A + 


a) 0+ X$; ai +1) and maximizes In((01) ^t“! (1 — 91) Zi=20i+a:)-1) for the 
posterior mode. 


7.2.7 Recall that the posterior distribution of 0; in Example 7.2.2 is Beta( fı +a1, f2 + 
-+-+ fk +a2+--- +a). Find the second moment and use Var(01 | x1, ..., Xn) = 
EO? |x1,--.,Xn)-(E@1 |x, --.,4n))?. Now 0 < fi/n < 1, so Var(01 | x1, ..., Xn) 
= (fi/n+a Èi (fi/n +a) / nA + Die) ai /n + 1/n)+ Diy a/n} > 0 
asn > œo. 

7.2.9 The posterior predictive density of x,+ is obtained by averaging the N (x, (1/ tot 


n/ Ga) oes) density with respect to the posterior density of u, so we must have that 
this is also the posterior predictive distribution. 


7.2.11 The posterior predictive distribution of t = xn+1 is (nx + f9)Pareto(n + ao). 
So the posterior mode is ¢ = 0, the posterior expectation is (nx + Bo)/(n + ao — 1), 
and the posterior variance is (nx + £o)? (n + ao)/[(n + ag — 1)? (n + ao — 2)]. 


7.2.13 (a) The posterior distribution of ø? is inverse Gamma(n/2 + ao, 8,), where 
By = (n — 1)s?/2 +.0% — uo)?/2 + Bo. (b) E(0? | x1, ..., Xn) = By /(n/2 + a0 — 
1). (c) To assess the hypothesis Ho : o? < o$, compute the probability II(1/o* > 
1/03 | x1,- <, Xn) = 1-G(2B,./05; 2a9 +n) where G(-; 2a9 +n) is the y7(2a9+n) 
cdf. 


7.2.15 (a) The odds in favor of A = 1/odds in favor of A‘. (b) BF (A) = 1/BF(A‘) 


7.2.17 Statistician I’s posterior probability for Ho is 0.0099. Statistician II’s posterior 
probability for Ho is 0.0292. Hence, Statistician II has the bigger posterior belief in 
A. 


7.2.19 The range of a Bayes factor in favor of A ranges in [0, co). If A has posterior 
probability equal to 0, then the Bayes factor will be 0. 


7.3.1 (3.2052, 4.4448) 

7.3.3 The posterior mode is A = (nt /v6+Lo/o5)/(n/vg+1/05) and 67(x1,...,%n) = 
(n/v§ + 1/o6)7!. Hence, the asymptotic y -credible interval is (@ — za4))/26, À + 
Z(+y)/26)- 

7.3.5 For a sample (x1, ..., Xn), the posterior distribution is N(x, 1/n) restricted to 
[0, 1]. A simple Monte Carlo algorithm for the posterior distribution is 1. Generate y 


from N(x, 1/n), 2. Accept 7 if it is in [0, 1] and return to step 1 otherwise. If the true 
value 6, is not in [0, 1], then the acceptance rate will be very small for large n. 


7.4.1 The posterior density is proportional to 2"+4—! exp{—A(In({] (1 + x:)) + B)}- 


Appendix E: Answers to Odd-Numbered Exercises 741 


7.4.3 (a) The maximum value of the prior predictive is obtained when t = 1. (b) The 
posterior of 0 given t = 1 is 


a/2ay3y _ 32 


= 59/1728 — 59 ea 
TAIL = 7 aaa 29 0=b 
59/1728 ~ 59 ee 


7.4.5 The prior predictive is given by ma,g (X1,.--, Xn) = Tor Ties ae D, 
Based on the prior predictive, we would select the prior given by a = 1, 8 = 1. 
7.4.7 Jeffreys’ prior is /n67!/? (1 — 0)7 1/2 . The posterior distribution of @ is Beta(nx 
+1/2,n (1 — x) + 1/2). 

7.4.9 The prior distribution is 9 ~ N (66, a?) with a? = 101.86. 

7.4.11 Let the prior be 0 ~ Exponential(/) with 2 = 0.092103. 

8.1.1 L(1|-) = (3/2) L(2 | -), so by Section 6.1.1 T is a sufficient statistic. The condi- 
tional distributions of s are as follows. 


s=1 s= s=3 s=4 ] 
= I yi2 T76 I 
fa (8 |T = 1) HRT I6 = 3 0 0 
_ a: 23i] 
f(s |T = 1) T2143 Tea 3 0 0 


fa@|T=3)| 0 0 I 7 
fsIT=3)] 0 0 1 0 
s=1 s=2 s=3 s=4 

fasIT=4)| 0 0 0 l 
fos|T=4) | 0 0 0 J 

8.1.3 2+ (1 — 1/n) o? 

8.1.5 UMVU for 5 + 2u 

8.1.7 x/ao 

8.1.9 n7! OP 11) (Xi) 

8.1.11 Yes 


8.2.1 When a = 0.1, co = 3/2 and y = ((1/10) — (1/12)) / (1/2) = 1/30. The 
power of the test is 23/120. When a = 0.05, co = 2 and y = ((1/20) — 0) / (1/12) = 
3/5.The power of the test is 1/10. 


8.2.3 By (8.2.6) the optimal 0.01 test is of the form 


1 ī>1+ 322.3263 1 ī > 2.0404 


po (xX) = = 


0 z<1+ X2 2.3263 0 


8.2.5 (a) 0 (b) Suppose 0 > 1. The power function is £ (0) = 1 — 1/8. 
8.2.7n >4 


< 2.0404. 


bai] 


742 Appendix E: Answers to Odd-Numbered Exercises 


8.2.9 The graph of the power function of the UMP size a test function lies above the 
graph of the power function of any other size test a function. 

8.3.1 II (8 = 1|2) = 2/5, II (0 = 2|2) = 3/5, so II (0 = 2|2) > II (8 = 1 |2) and 
we accept Ho : 0 = 2. 

8.3.3 The Bayes rule is given by (lice + n/o5) | (uo/T + nx /o5), which converges 
tox as To > &. 

8.3.5 The Bayes rule is given by (nao + To) / (nx + vo) and by the weak law of large 
numbers this converges in probability to 8 as n —> oo. 


8.3.7 The Bayes rule rejects whenever 


z 2 
exp (- (x — uo) ) 
1 A ıfı a afk, 
= n Jt i n Lo nz ae FO nx“ 
To (4+3) exp 1(4+4) (4 +43) i(4 E =) 


is less than (1 — po)/po. As to — oo, the denominator converges to 0, so in the limit 
we never reject Ho. 

8.4.1 The model is given by the collection of probability functions {9”* (1 — @)"~" : 
0 e€ [0, 1]} on the set of all sequences (x1, ..., Xn) of 0’s and 1’s. The action space is 
A = [0, 1], the correct action function is A (0) = 0, and the loss function is L (0, a) = 
(0 — a}? . The risk function for T is Rr (@) = Varg (¥) = 0 (1 — 8) /n. 

8.4.3 The model is the set of densities {(2706)~!/? exp{— 27) (ti — 4)? /204} : u € 
R'} on R”. The action space is A = R!, the correct action function is A (4u) = 4, 
and the loss function is L (u,a) = (u — a). The risk function for T is Rr (u) = 
Var, (x) = Onin 

8.4.5 (a) Ra (a) = 1/2, Ra (b) = 3/4 (b) No. Consider the risk function of the decision 
function d* given by d* (1) = b, d* (2) =a, d* (3) = b, d* (4) =a. 

9.1.1 The observed discrepancy statistic is given by D (r) = 22.761 and the P-value is 
P (D(R) > 22.761) = 0.248, which doesn’t suggest evidence against the model. 


9.1.3 (c) The plots suggest that the normal assumption seems reasonable. 


9.1.5 The observed counts are given in the following table. 


Interval Count 
(0.0, 0.2] 4 
(0.2, 0.4] F 
(0.4, 0.6] 3 
(0.6, 0.8] 4 
(0.8, 1] 2 


The chi-squared statistic is equal to 3.50 and the P-value is given by (X? ~ y? (4)) 
P (x 2> 3.5) = 0.4779. Therefore, we have no evidence against the Uniform model 
being correct. 

9.1.7 (a) The probability of the event s = 3 is 0 based on the probability measure P 
having S as its support. The most appropriate P-value is 0. (b) 0.729 


Appendix E: Answers to Odd-Numbered Exercises 743 


9.1.9 No 
9.1.11 (a) The conditional probability function of (x1, ..., Xn) is 


o*a _ o0- n o*a _ gyn) = 1/ n : 

nx nx 
(b) Hypergeometric(n, |n/2], nxo) (c) 0.0476 
9.2.1 (a) No (b) P-value is 1/10, so there is little evidence of a prior—data conflict. (c) 
P-value is 1/300, so there is some evidence of a prior—data conflict. 
9.2.3 We can write x = u +z, where z ~ N(O, ojn) is independent of u ~ 
N (uo, Gs): 
9.2.5 The P-value for checking prior—data conflict is 0. Hence, there is definitely a 
prior—data conflict. 
10.1.1 For any x1, x2 (that occur with positive probability) and y, we have P(Y 
y|X =x) = PY = y|X = x2). Thus P(X = xn Y = y) = P(X = xY = 
y)P(X = x1)/P(X = x2), and summing this over x; leads to P(X = x2, Y = y) 
P(X = x2)P (Y = y). For the converse, show P(Y = y | X =x) = P(Y = y). 
10.1.3 X and Y are related. 
10.1.5 The conditional distributions P (Y = y | X = x) will change with x whenever 
X is not degenerate. 


10.1.7 If the conditional distribution of life-length given various smoking habits changes, 
then we can conclude that these two variables are related, but we cannot conclude that 
this relationship is a cause-effect relationship due to the possible existence of con- 
founding variables. 


10.1.9 The researcher should draw a random sample from the population of voters 
and ask them to measure their attitude toward a particular political party on a scale 
from favorably disposed to unfavorably disposed. Then the researcher should randomly 
select half of this sample to be exposed to a negative ad, while the other half is exposed 
to a positive ad. They should all then be asked to measure their attitude toward the 
particular political party on the same scale. Next compare the conditional distribution 
of the response variable Y (the change in attitude from before seeing the ad to after), 
given the predictor X (type of ad exposed to), using the samples to make inference 
about these distributions. 


10.1.11 (a) {(0, 100) , (1, 100)} (b) A sample has not been taken from the population 
of interest. The individuals involved in the study have volunteered and, as a group, 
they might be very different from the full population. (c) We should group the indi- 
viduals according to their initial weight W into homogenous groups (blocks) and then 
randomly apply the treatments to the individuals in each block. 


10.1.13 (a) The response variable could be the number of times an individual has 
watched the program. A suitable predictor variable is whether or not they received 
the brochure. (b) Yes, as we have controlled the assignment of the predictor variable. 
10.1.15 W has a relationship with Y and X has a relationship with Y. 


10.1.17 (a) 


744 Appendix E: Answers to Odd-Numbered Exercises 


X=0 X=1 | Sum 
Rel. Freq. 0.5 0.5 1.0 


(b) 
Y=0 Y=1 | Sum 
Rel. Freq. 0.7 0.3 1.0 
(c) 
Rel. Freq. | X =O X=1 | Sum 
Y=0 0.3 0.4 0.7 
Y=1 0.2 0.1 0.3 
sum 0.5 0.5 1.0 
(d) 
P(Y =y|X =x) | y=0 y=1 | Sum 
x=0 0.6 0.4 1.0 
x=1 0.8 0.2 1.0 
(e) Yes 


10.1.19 X and Y are related. We see that only the variance of the conditional distribu- 
tion changes as we change X. 

10.1.21 The correlation is 0, but X and Y are related. 

10.2.1 The chi-squared statistic is equal to BG = 5.7143 and, with X? ~ y7(2), the P- 
value equals P(X? > 5.7143) = 0.05743. Therefore, we don’t have evidence against 
the null hypothesis of no difference in the distributions of thunderstorms between the 
two years, at least at the 0.05 level. 

10.2.3 The chi-squared statistic is equal to X 7 = 0.10409 and, with X? ~ xy? (1), the 
P-value equals P(X* > 4.8105) = 0.74698. Therefore, we have no evidence against 
the null hypothesis of no relationship between the two digits. 

10.2.5 (a) The chi-squared statistic is equal to X s = 10.4674 and, with X? ~ y? (4, 
the P-value equals P(X 2 > 10.4674) = 0.03325. Therefore, we have some evidence 
against the null hypothesis of no relationship between hair color and gender. (c) The 
standardized residuals are given in the following table. They all look reasonable, so 
nothing stands out as an explanation of why the model of independence does not fit. 
Overall, it looks like a large sample size has detected a small difference. 


Y = fair Y =red Y = medium Y = dark Y = jet black 
X =m | —1.07303 0.20785 1.05934 —0.63250 1.73407 
f 1.16452 —0.22557 —1.14966 0.68642 —1.88191 


10.2.7 We should first generate a value for X; ~ Dirichlet(1, 3). Then generate U2 
from the Beta(1, 2) distribution and set X2 = (1 — X1) U2. Next generate U3 from the 
Beta(1, 1) distribution and set X3 = (1 — Xı — X2) U3. Finally, set X4 = 1 — Xı — 
X2 — X3. 


Appendix E: Answers to Odd-Numbered Exercises 745 


10.2.9 Then there are 36 possible pairs (i, j) for i, j = 1,...,6. Let fj; denote 
the frequency for (i, j) and compute chi-squared statistic, X? = ee 1 Xil fij - 
fi-f.j/n)?/(fi-f.j/n). Compute the P-value P(y?(25) > X°’). 

10.2.11 We look at the differences | fii - fi-f.j/n| to see how big these are. 

10.3.1 x 

10.3.3 x 

10.3.5 (b) y = 29.9991 +2.10236x (e) The plot of the standardized residuals against X 
indicates very clearly that there is a problem with this model. (f) Based on part (e), it is 
not appropriate to calculate confidence intervals for the intercept and slope. (g) Nothing 
can be concluded about the relationship between Y and X based on this model, as we 
have determined that it is inappropriate. (h) R? = 486.193/7842.01 = 0.062, which 
is very low. 

10.3.7 (b) b2 = 1.9860 and bı = 58.9090 

(d) The standardized residual of the ninth week departs from the other residuals in part 
(c). This provides some evidence that the model is not correct. (e) The confidence inter- 
val for 6, is [44.0545, 72.1283], and the confidence interval for £, is [0.0787, 3.8933]. 
(f) The ANOVA table is as follows. 


| Source | Df Sum of Squares Mean Square 
X 1 564.0280 564.0280 
Error | 10 1047.9720 104.7972 
Total | 11 1612.0000 


So the F-statistic is F = 5.3821 and P(F(1,10) > 5.3821) < 0.05 from Table 
D.5. Hence, we conclude there is evidence against the null hypothesis of no linear 
relationship between the response and the predictor. (g) R? = 0.3499 so, almost 35% 
of the observed variation in the response is explained by changes in the predictor. 
10.3.9 In general, E (Y | X) = exp(£; + 2X) is not a simple linear regression model 
since it cannot be written in the form E (Y | X) = Bj + 23V, where V is an observed 
variable and the J¥ are unobserved parameter values. 

10.3.11 We can write E (Y | X) = E(Y | X?) in this case and E (Y | X*) = $1 + BX’, 
so this is a simple linear regression model but the predictor is X? not X. 

10.3.13 R? = 0.05 indicates that the linear model explains only 5% of the variation in 
the response, so the model will not have much predictive power. 

10.4.1 (b) Both plots look reasonable, indicating no serious concerns about the correct- 
ness of the model assumptions. (c) The ANOVA table for testing Hp : 8; = b2 = p3 
is given below. 


Source | Df SS MS 

A 2 437 2.18 

Error 9 18.85 2.09 
Total | 11 23.22 


The F statistic for testing Ho is given by F = 2.18/2.09 = 1.0431, with P-value 
P(F > 1.0431) = 0.39135. Therefore, we don’t have evidence against the null hy- 
pothesis of no difference among the conditional means of Y given X. (d) Since we 


746 Appendix E: Answers to Odd-Numbered Exercises 


did not find any relationship between Y and X, there is no need to calculate these 
confidence intervals. 

10.4.3 (b) Both plots indicate a possible problem with the model assumptions. (c) The 
ANOVA table for testing Ho : 2; = £ha is given below. 


Source | Df SS MS | 

Cheese | 1 0.114 0.114 
Error 10 26.865 2.686 
Total 11 26.979 


The F statistic for testing Hp is given by F = 0.114/2.686 = 0.04 and with the P- 
value P(F > .04) = 0.841. Therefore, we do not have any evidence against the null 
hypothesis of no difference among the conditional means of Y given Cheese. 

10.4.5 (b) Both plots look reasonable, indicating no concerns about the correctness of 
the model assumptions. (c) The ANOVA table for testing Ho : 1 = fo = p3 = By 
follows. 


Source Df SS MS 
Treatment | 3 19.241 6.414 
Error 20 11.788 0.589 
Total 23 31.030 


The F statistic for testing Hp is given by F = 6.414/0.589 = 10.89 and with P- 
value P(F > 10.89) = 0.00019. Therefore, we have strong evidence against the null 
hypothesis of no difference among the conditional means of Y, given the predictor. (d) 
The 0.95-confidence intervals for the difference (column level mean)-(row level mean) 
between the means are given in the following table. 


mE 2 3 | 
(—3913, 1.4580) 
(—2.2746, —0.4254) | (—2.8080, —0.9587) 
(—2.5246, -0.6754) | (—3.0580, -1.2087) | (1.1746, 0.6746) 


A UN 


10.4.7 (b) Treating the marks as separate samples, the ANOVA table for testing any 
difference between the mean mark in Calculus and the mean mark in Statistics follows. 


Source | Df SS MS 

Course | 1 36.45 36.45 
Error | 18 685.30 38.07 
Total | 19 721.75 


The F statistic for testing Ho : 2; = £2 is given by F = 36.45/38.07 = 0.95745, 
with the P-value equal to P(F > 0.95745) = 0.3408. Therefore, we do not have any 
evidence against the null hypothesis of no difference among the conditional means of 
Y given Course. 

Both residual plots look reasonable, indicating no concerns about the correctness 
of the model assumptions. (c) Treating these data as repeated measures, the mean 
difference between the mark in Calculus and the mark in Statistics is given by d = 


Appendix E: Answers to Odd-Numbered Exercises 747 


—2.7 with standard deviation s = 2.00250. The P-value for testing Hp : 44 = Lo, 
is 0.0021, so we have strong evidence against the null. Hence, we conclude that there 
is a difference between the mean mark in Calculus and the mean mark in Statistics. 
A normal probability plot of the data does not indicate any reason to doubt model 
assumptions. (d) rxy = 0.944155 

10.4.9 When Yı and Y2 are measured on the same individual, we have that Var(Y; — 
Y2) = 2(Var(Y1)— Cov(Y1, Y2)) > 2Var(Y1) since Cov(Y1, Y2) < 0. If we had mea- 
sured Yı and Y2 on independently randomly selected individuals, then we would have 
that Var(Y; — Y2) = 2Var(Y1). 

10.4.11 The difference of the two responses Yı and Y2 is normally distributed, i.e., 
Yı — Y ~ N(u, 0°). 

10.4.13 (1) The conditional distribution of Y, given (X1, X2), depends on (X1, X2) 
only through E (Y | X1, X2), and the error Z = Y — E(Y | X1, X2) is independent of 
(X1, X2). (2) The error Z = Y — E (Y | X1, X2) is normally distributed. (3) X, and X2 
do not interact. 

10.5.1 F (x) = fet (1+ e~)? dt = (1+ e~) Eo = (1+ e™)7 > Las 
x > œ and p = F (x), implies x = In(p/ (1 — p)). 

10.5.3 Let / = I(p) = In(p/(1 — p)) be the log odds so e! = p/(1— p) = 1/(1/p—1). 
Hence, e! /(1 + e!) = p, and substitute / = 8, + Box. 

10.5.5 P(Y = 1|X1 = x1, ..., Xk = xk) = 1/2 + arctan(f 1x1 + <- - + Pkxk)/T 
11.1.1 (a) 0 (b) 0 (c) 1/3 (d) 2/3 (e) 0 ®© 4/9 (g) 0 (b) 1/9 (i) 0 G) 0 (k) 0.00925 (1) 0 
(m) 0.0987 

11.1.3 (a) 5/108 (b) 5/216 (c) 5/72 (d) By the law of total probability, P (X3 = 8) = 
P(X; =6, X3 = 8) + P(X; = 8, X3 = 8). 

11.1.5 (a) Here, P(te < to) = 0.89819. That is, if you start with $9 and repeatedly 
make $1 bets having probability 0.499 of winning each bet, then the probability you 
will reach $10 before going broke is equal to 0.89819 (b) 0.881065 (c) 0.664169 (d) 
0.0183155 (e) 4 x 107!8 (f) 2 x 107174 

11.1.7 We use Theorem 11.1.1. (a) 1/4 (b) 3/4 (c) 0.0625 (d) 1/4 (e) 0 (£) 1 (g) We 
know that the initial fortune is 5, so to get to 7 in two steps, the walk must have been 
at 6 after the first step. 

11.1.9 (a) 18/38 (b) 0.72299 (c) 0.46056 (d) 0 (e) In the long run, the gambler loses 
money. 

11.2.1 (a) 0.7 (b) 0.1 (c) 0.2 (d) 1/4 (e) 1/4 Œ®) 1/2 (g) 0.3 

11.2.3 (a) Po(X2 = 0) = 0.28, Po(X2 = 1) = 0.72, Pı (X2 = 0) = 0.27, Pı (X2 = 
1) = 0.73 (b) Po (X3 = 1) = 0.728 

11.2.5 (a) 1/2 (b) 0 (c) 1/2 (d) 1/2 (e) 1/10 (f) 2/5 (g) 37/100 (h) 11/20 (@) 0 G) 0 (k) 
0 (1) 0 (m) No 

11.2.7 This chain is doubly stochastic, i.e., has X`; pij = 1 for all j. Hence, we must 
have the uniform distribution (m1 = T2 = 73 = T4 = 1/4) as a stationary distribution. 


11.2.9 (a) By either increasing or decreasing one step at a time, we see that for all 
i and j, we have De > 0 for some n < d. (b) Each state has period 2. (c) If i 


748 Appendix E: Answers to Odd-Numbered Exercises 


and j are two or more apart, then pij = pji = 0. If j = i +1, then zipij = 
(1/24)((d — 1)!/i!(d — i — 1)!), while Tjpji = (1/2%)((d — 1)!/il(d — i — 1)». 
11.2.11 (a) This chain is irreducible. (b) The chain is aperiodic. (c) 71 = 2/9, t2 = 
3/9, 73 = 4/9 (d) iMn Pi (Xn = 2) = 12 = 3/9 = 1/3, so Pi (X500 = 2) & 1/3. 
11.2.13 Pı(Xı + X2 > 5) = 0.54 

11.2.15 (a) The chain is irreducible. (b) The chain is not aperiodic. 

11.3.1 First, choose any initial value Xo. Then, given X, = i, let Y,4; =i+lori-1, 
with probability 1/2 each. Let j = Y,41 and let aj; = min(1, e~U—13)+6-13)'), 
Then let X;4; = j with probability a;;, otherwise let Xn+ı = i with probability 
1— aij. 


11.3.3 First, choose any initial value Xo. Then, given X, = i, let Yp}ı = i + 1 
with probability 7/9 or Y,41 = i — 1 with probability 2/9. Let j = Y,41 and, if 
j=i+l1, let aj = min(l, e77- (2/9) e7- -* (7/9) or, if j = i — 1, 
then let aj; = min(1, e~/'~/°- 7° (7/9) /e~"-®—* (2/9)). Then let X41 = j with 
probability a; ;, otherwise let X,41 =i with probability 1 — aij. 

11.3.5 Let {Zn} be iid. ~ N(O, 1). First, choose any initial value Xo. Then, given 
Xn = x, let Yp41 = Xn + V 10 Zy41. Let y = Y,41 and let axy = min(1, exp{—y* — 
y® — y8 + x4 + x6 + x8}). Then let X,41 = y with probability axy, Otherwise let 
Xn+1 = x with probability 1 — a xy. 

11.4.1 C = 12/5 

11.4.3 p = 1/3 

11.4.5 P(X, = 4) = 5/8 

11.4.7 (a) Here, E(Xn41| Xn) = 1/4)(3Xn) + 3/4) (Xn/3) = Xn. b) T is non- 
negative, integer-valued, and does not look into the future, so it is a stopping time. (c) 
E(Xr) = Xo = 27. (d) P (Xr = 1) = 27/40 

11.5.1 (a) 1/2 (b) 0 (c) 1/4 (d) We have P(Y( > 1) = PO > J/M//M). 
Hence, P(Y{” > 1) = 1/2, PY > 1) = 1/4, PY® > 1) =3/8, PY > 1) = 
5/16. 

11.5.3 (a) P (B2 > 1) = ®(—1/V2) (b) P(B3 < —4) = ®(—4/V3) (c) P (Bo — Bs 
2.4) = 1 — ®(—2.4/2) (d) P(By — By > 9.8) = O(—9.8/V15) (e) P(B263 
—6) = O(—6/V 26.3) (f) P (B26.3 < 0) = ®(0/V 26.3) = 1/2 

11.5.5 E (B13 B8) = 8 

11.5.7 (a) 3/4 (b) 1/4 (c) The answer in part (a) is larger because —5 is closer to By = 0 
than 15 is, whereas —15 is farther than 5 is. (d) 1/4 (e) We have 3/4 + 1/4 = 1, which 
it must since the events in parts (a) and (d) are complementary events. 

11.5.9 E(X3X5) = 61.75 

11.5.11 (a) P(X1o > 250) = ®(—20/1V 10) (b) P(X19 > 250) = B(—20/4/ 10) (c) 
P(X 9 > 250) = ®(—20/10V 10) (d) P (X10 > 250) = ®(—20/100/ 10) 

11.6.1 (a) e~!414!3 /13! (b) e735353/3! (c) e7 42427920! (d) e7 3°°350749 /340! (e) 0 
(f) (e7!414!3/131) (e77!217/7!) (g) 0 

11.6.3 P(N2 = 6) = e~7/3(2/3)°/6!, P(N3 = 5) = e7? (3/3)5/5! 


IA IA 


Appendix E: Answers to Odd-Numbered Exercises 749 


11.6.5 P(N26 = 2| N29 = 2) = (2.6/2.9)? 


Index 


0-1 loss function, 467 Bayesian P-value, 395 
Bayesian updating, 383 
a priori, 374 bell-shaped curve, 56 
abs command, 685 Bernoulli distribution, 42, 131 
absolutely continuous best-fitting line, 542 
jointly, 85 beta command , 686 
absolutely continuous random variable, 52 beta distribution, 61 
acceptance probability, 644 beta function, 61 
action space, 464 bias, 271, 322 
additive, 5 binom command, 686 
additive model, 593 binomial coefficient, 17 
adjacent values, 287 binomial distribution, 43, 116, 131, 162, 
admissible, 470 163, 167 
admissiblity, 470 binomial theorem, 131 
alternative hypothesis, 448 birthday problem, 19 
analysis of covariance, 595 bivariate normal distribution, 89 
analysis of variance (ANOVA), 545 blinding, 521 
analysis of variance (ANOVA) table, 545 blocking variable, 523, 594 
ancillary, 481 blocks, 523 
anova command, 690 bootstrap mean, 353 
ANOVA (analysis of variance), 545 bootstrap percentile confidence interval, 355 
ANOVA test, 548 bootstrap samples, 353 
aov command, 690 bootstrap standard error, 353 
aperiodicity, 635 bootstrap ¢ confidence interval, 355 
bootstrapping, 351, 353 
balance, 520 Borel subset, 38 
ball, 626 boxplot, 287 
bar chart, 288 boxplot command, 688 
barplot command, 688 Brown, R., 657 
basis, 559 Brownian motion, 657, 659 
batching, 414 properties, 659 
Bayes factor, 397 Buffon’s needle, 234 
Bayes risk, 471 burn-in, 643 
Bayes rule, 460, 471 
Bayes’ Theorem, 22 calculus, 675 
Bayesian decision theory, 471 fundamental theorem of, 676 
Bayesian model, 374 categorical variable, 270 


751 


752 


Cauchy distribution, 61, 240 

Cauchy—Schwartz inequality, 186, 197 

cause-effect relationship, 516 

cbind command, 698 

cdf (cumulative distribution function), 62 
inverse, 120 

ceiling command, 685 

ceiling (least integer function), 295 

census, 271 

central limit theorem, 215, 247 

chain rule, 676 

characteristic function, 169 

Chebychev’s inequality, 185 

Chebychev, P. L., 2 

chi-squared distribution, 236 

chi-squared goodness of fit test, 490, 491 

chi-squared statistic, 491 

chi-squared(n), 236 

x(n), 236 

chisq command, 686 

chisq.test command, 688 

classification problems, 267 

coefficient of determination (R°), 546 

coefficient of variation, 267, 360 

combinatorics, 15 

complement, 7, 10 

complement of B in A, 7 

complementing function, 315 

completely crossed, 522 

completeness, 438 

composite hypothesis, 466 

composition, 676 

conditional density, 96 

conditional distribution, 94, 96 

conditional expectation, 173 

conditional probability, 20 

conditional probability function, 95 

conditionally independent, 184 

confidence interval, 326 

confidence level, 326 

confidence property, 326 

confidence region, 290 

confounds, 517 

conjugate prior, 422 

consistency, 325 

consistent, 200, 325 


constant random variable, 42 
continuity properties, 28 
continuous random variable, 51—53 
continuous-time stochastic process, 658, 
666 
control, 254 
control treatment, 521 
convergence 
almost surely , 208 
in distribution, 213 
in probability, 204, 206, 210, 211, 246 
with probability 1, 208, 210, 211, 246 
convolution, 113 
correct action function, 464 
correction for continuity, 219, 358 
correlation, 89, 156 
cos command, 685 
countably additive, 5 
counts, 102 
covariance, 152, 153 
covariance inequality, 440 
Cramer—Rao inequality, 441 
Cramer—von Mises test, 495 
craps (game), 27 
credible interval, 391 
credible region, 290 
critical value, 446, 448 
cross, 587 
cross-ratio, 537 
cross-tabulation, 687 
cross-validation, 495 
cumulative distribution function 
inverse, 120 
joint, 80 
properties of, 63 
cumulative distribution function (cdf), 62 


data reduction, 303 
decision function, 467 
decision theory model, 464 
decreasing sequence, 28 
default prior, 425 
degenerate distribution, 42, 131 
delta theorem, 351 
density 

conditional, 96 


753 


proposal, 646 joint, 80 
density function, 52 properties of, 63 
joint, 85 distribution of a random variable, 38 
density histogram function, 274 distribution-free, 349 
derivative, 675 Doob, J., 2, 37 
partial, 678 double ’til you win, 618 
descriptive statistics, 282 double blind, 521 
design, 519 double expectation, 177 
det command, 698 double use of the data, 507 
diag command, 698 doubly stochastic matrix, 631, 632 
diffusions, 661 drift, 662 
Dirichlet distribution, 93 dummy variables, 578 
discrepancy statistic, 480 
discrete random variable, 41 ecdf command, 688 
disjoint, 5 Ehrenfest’s urn, 625 
distribution empirical Bayesian methods, 423 
F, 241 empirical distribution function, 271 
t, 239 empty set, 5 
Bernoulli, 42, 131 error sum of squares (ESS), 545 
beta, 61 error term, 516 
binomial, 43, 116, 131, 162, 163,167 ESS (error sum of squares), 545 
Cauchy, 61 estimation, 290 
chi-squared, 236 estimation, decision theory, 465 
conditional, 94, 96 estimator, 224, 434 
degenerate, 42, 131 event, 5 
exponential, 54, 61, 142, 165, 166 exact size a test function, 449 
extreme value, 61 exp command, 686 
gamma, 55, 116 expectation, 173 
geometric, 43, 132 expected value, 129, 130, 141, 191 
hypergeometric, 47 linearity of, 135, 144, 192 
joint, 80 monotonicity of, 137, 146, 192 
Laplace, 61 experiment, 518 
log-normal, 79 experimental design, 520 
logistic, 61 experimental units, 519 
mixture, 68 exponential distribution, 54, 142, 165, 166 
negative-binomial, 44, 116 memoryless property of, 61 
normal, 57, 89, 116, 142, 145, 234 extrapolation, 543 
Pareto, 61 extreme value distribution, 61 
point, 42 
Poisson, 45, 132, 162, 164 F,62 
proposal, 644 f command, 686 
standard normal, 57 F distribution, 241 
stationary, 629 Fy, 62 
uniform, 7, 53, 141, 142 Fy(a_), 63 
Weibull, 61 fx, 59 


distribution function, 62 F(m,n), 241 


754 


F-statistic, 546 
factor, 519 
factorial, 16, 677 
fair game, 617 
family error rate, 581 
Feller, W., 2, 37 
Fermat, P. de, 2 
finite population correction factor, 280 
first hitting time, 618 
Fisher information, 365 
Fisher information matrix, 372 
Fisher signed deviation statistic, 363 
Fisher’s exact test, 484 
Fisher’s multiple comparison test, 581 
fitted values, 560 
floor command, 685 
floor (greatest integer function), 119 
for command, 691 
fortune, 615 
frequentist methods, 374 
frog, 626, 641 
function 
Lipschitz, 665 
fundamental theorem of calculus, 676 


gambler’s ruin, 618 
gambling, 615 
gambling strategy 
double ’til you win, 618 
gamma command (function), 685 
gamma command (distribution), 686 
gamma distribution, 55, 116 
gamma function, 55 
y -confidence interval, 326 
generalized hypergeometric distribution, 51 
generalized likelihood ratio tests, 455 
generating function 
characteristic function, 169 
moment, 165 
probability, 162 
geom command, 686 
geometric distribution, 43, 132 
geometric mean, 200 
Gibbs sampler, 647 
Gibbs sampling, 413 
glm command, 690 


greatest integer function (floor), 119 
grouping, 274 


Hall, Monty, 27, 28 

hierarchical Bayes, 424 

higher-order transition probabilities, 628 

highest posterior density (HPD) intervals, 
392 

hist command, 687 

hitting time, 618 

HPD (highest posterior density) intervals, 
392 

hyper command, 686 

hypergeometric distribution, 47 

hyperparameter, 422 

hyperprior, 424 

hypothesis assessment, 290, 332 

hypothesis testing, 446 

hypothesis testing, decision theory, 466 


i.i.d. (independent and identically distrib- 
uted), 101 

identity matrix, 560 

if command, 691 

importance sampler, 233 

importance sampling, 233 

improper prior, 425 

inclusion—exclusion, principle of, 12, 14 

increasing sequence, 28 

independence, 24, 98, 101, 137 

pairwise, 24 


independent and identically distributed (1.1.d.), 


101 

indicator function, 35, 210 
indicator function, expectation, 131 
individual error rate, 581 
inequality 

Cauchy—Schwartz, 186, 197 

Chebychev’s, 185 

Jensen’s, 187 

Markov’s, 185 
inference, 258 
infinite series, 677 
information inequality, 441 
initial distribution, 623 
integral, 676 


intensity, 666 

interaction, 522, 587 

intercept term, 562 

interpolation, 543 

interquartile range (IQR), 286 

intersection, 7 

inverse cdf, 120 

inverse Gamma, 380 

inverse normalizing constant, 376 

inversion method for generating random 
variables, 121 

IQR (interquartile range), 286 

irreducibility, 634 


Jacobian, 110 

Jeffreys’ prior, 426 

Jensen’s inequality, 187 

joint cumulative distribution function, 80 
joint density function, 85 

joint distribution, 80 

jointly absolutely continuous, 85 


k-th cumulant, 173 
Kolmogorov, A. N., 2 
Kolmogorov—Smirnov test, 495 
kurtosis statistic, 483 


Laplace distribution, 61 
large numbers 

law of, 206, 211 
largest-order statistic, 104 
latent variables, 414, 415 
law of large numbers 

strong, 211 

weak, 206 
law of total probability, 11, 21 
least integer function (ceiling), 295 
least relative suprise estimate, 406 
least-squares estimate, 538, 560 
least-squares line, 542 
least-squares method, 538 
least-squares principle, 538 
Lehmann-Scheffé theorem, 438 
length command, 685 
levels, 520 
lgamma command, 685 


755 


likelihood, 298 

likelihood function, 298 
likelihood principle, 299 
likelihood ratios, 298 
likelihood region, 300 

Likert scale, 279 

linear independence property, 559 
linear regression model, 558 
linear subspace, 559 

linearity of expected value, 135, 144, 192 
link function, 603 

Lipschitz function, 665 

1m command, 689 

location, 136 

location mixture, 69 

log command, 685 

log odds, 603 

log-gamma function, 383 
log-likelihood function, 310 
log-normal distribution, 79 
logistic distribution, 61, 606 
logistic link, 603 

logistic regression model, 603 
logit, 603 

loss function, 465 

lower limit, 287 

1s command, 686 

lurking variables, 518 


macro, 700 

MAD (mean absolute deviation), 469 
margin of error, 329 

marginal distribution, 82 

Markov chain, 122, 623 

Markov chain Monte Carlo, 643 
Markov’s inequality, 185 

Markov, A. A., 2 

martingale, 650 

matrix, 559, 678 

matrix inverse, 560 

matrix product, 560 

max command, 685 

maximum likehood estimator, 308 
maximum likelihood estimate (MLE), 308 
maximum of random variables, 104 
mean command, 688 


756 


mean absolute deviation (MAD), 469 
mean value, 129, 130 
mean-squared error (MSE), 321, 434, 469 
measurement, 270 
measuring surprise (P-value), 332 
median, 284 
median command, 688 
memoryless property, 61 
Méré, C. de, 2 
method of composition, 125 
method of least squares, 538 
method of moments, 349 
method of moments principle, 350 
method of transformations, 496 
Metropolis—Hastings algorithm, 644 
min command, 685 
minimal sufficient statistic, 304 
minimax decision function, 471 
Minitab, 699 
mixture distribution, 68 

location, 69 

scale, 70 
MLE (maximum likelihood estimate), 308 
mode of a density, 260 
model checking, 266, 479 
model formula, 688 
model selection, 464 
moment, 164 
moment-generating function, 165 
monotonicity of expected value, 137, 146, 

192 

monotonicity of probabilities, 11 
Monte Carlo approximation, 225 
Monty Hall problem, 27, 28 
MSE (mean-squared error), 321, 434, 469 
multicollinearity, 515 
multinomial coefficient, 18 
multinomial distributions, 102 
multinomial models, 302, 305 
multiple comparisons, 510, 581 
multiple correlation coefficient, 565 
multiplication formula, 21 
multiplication principle, 15 
multivariate measurement, 271 
multivariate normal, 500 


N(O, 1), 57 

N(u, 07), 57 

NA (not available in R), 686 

nbinom command, 686 

ncol command, 698 

negative-binomial distribution, 44, 116 

Neyman—Pearson theorem, 450 

noninformative prior, 425 

nonrandomized decision function, 467 

nonresponse error, 277 

norm command, 686 

normal distribution, 57, 89, 116, 142, 145, 
234 

normal probability calculations, 66 

normal probability plot, 488 

normal quantile plot, 488 

normal score, 488 

nrow command, 698 

nuisance parameter, 338 

null hypothesis, 332 


observational study, 269 
observed Fisher information, 364 
observed relative surprise, 406 
odds in favor, 397 

one-sided confidence intervals, 347 
one-sided hypotheses, 347 
one-sided tests, 337 

one-to-one function, 110 
one-way ANOVA, 577 

optimal decision function, 470 
optimal estimator, 434 

optional stopping theorem, 653 
order statistics, 103, 284 

ordered partitions, 17 

orthogonal rows, 236 

outcome, 4 

outliers, 288 

overfitting, 481 


Px, 42 

P-value, 332 

paired comparisons, 585 
pairwise independence, 24 
parameter, 262 

parameter space, 262 


Pareto distribution, 61 

partial derivative, 678 

partition, 11 

Pascal’s triangle, 2, 632 

Pascal, B., 2 

pen, 626 

percentile, 283 

period of Markov chain, 635 
permutations, 16 

®, 66 

P;, 627 

placebo effect, 521 

plot command, 688 

plug-in Fisher information, 366 
plug-in MLE, 315 

point distribution, 42 

point hypothesis, 466 

point mass, 42 

pois command, 686 

Poisson distribution, 45, 132, 162, 164 
Poisson process, 50, 666 

polling, 276 

pooling, 593 

population, 270 

population cumulative distribution , 270 
population distribution, 270 
population interquartile range, 286 
population mean, 285 

population relative frequency function, 274 
population variance, 285 

posterior density, 376 

posterior distribution, 376 

posterior mode, 387 

posterior odds, 397 

posterior predictive, 400 

posterior probability function, 376 
power, 341 

power function, 341, 449, 469 

power transformations, 496 

practical significance, 335 

prediction, 258, 400 

prediction intervals, 576 

prediction region, 402 

predictor variable, 514 

principle of conditional probability, 259 
principle of inclusion—exclusion, 12, 14 


757 


prior elicitation, 422 
prior odds, 397 
prior predictive distribution, 375 
prior probability distribution, 374 
prior risk, 471 
prior—data conflict, 503 
probability, 1 
conditional, 20 
law of total, 11, 21 
probability function, 42 
conditional, 95 
probability measure, 5 
probability model, 5 
probability plot, 488 
probability-generating function, 162 
probit link, 603 
problem of statistical inference, 290 
process 
Poisson, 666 
random, 615 
stochastic, 615 
proportional stratified sampling, 281 
proposal density, 646 
proposal distribution, 644 
pseudorandom numbers, 2, 117 
pth percentile, 283 
pth quantile, 283 


q command, 683 

qqnorm command, 688 
quantile, 283 

quantile command, 688 
quantile function, 120 
quantiles, 284 

quantitative variable, 270 
quartiles, 284 

queues, 50 

quintile, 362 


Rt, 265 
R? (coefficient of determination), 546 
random numbers, 710 
random process, 615 
random variable, 34, 104 
absolutely continuous, 52 
constant, 42 


758 


continuous, 51—53 
discrete, 41 
distribution, 80 
expected value, 129, 130, 141, 191 
mean, 130 
standard deviation, 150 
unbounded, 36 
variance, 149 
random walk, 615, 616 
on circle, 625 
randomization test, 363 
randomized block design, 594 
rank command, 688 
Rao—Blackwell theorem, 436 
Rao—Blackwellization, 436 
rate command, 686 
rbind command, 698 
reduction principles, 470 
reference prior, 425 
regression assumption, 515 
regression coefficients, 541 
regression model, 516, 540 
regression sum of squares (RSS), 545 
reject, 448 
rejection region, 448 
rejection sampling, 122, 125 
related variables, 513 
relative frequency, 2 
y -relative surprise region, 406 
rep command, 684 
reparameterization, 265 
reparameterize, 309 
repeated measures, 584 
resamples, 353 
resampling, 353 
residual plot, 486 
residuals, 481, 560 
response, 4 
response curves, 588 
response variable, 514 
reversibility, 632 
right-continuous, 74 
risk, 3 
risk function, 467 
xm command, 686 
RSS (regression sum of squares), 545 


sample, 101 
sample command, 687 
sample a-trimmed mean, 355 
sample average, 206 
sample correlation coefficient, 190, 547 
sample covariance, 190, 547 
sample interquartile range 

TOR, 287 
sample mean, 206, 266 
sample median, 284 
sample moments, 350 
sample pth quantile, 284 
sample range, 361 
sample space, 4 
sample standard deviation, 286 
sample variance, 221, 266, 286 
sample-size calculation, 273, 340 
sampling 

importance, 233 

Monte Carlo, 122 

rejection, 122, 125 
sampling distribution, 199 
sampling study, 273 
sampling with replacement, 48 
sampling without replacement, 47, 48 
scale mixture, 70 
scan command, 684 
scatter plot, 542, 551 
score equation, 310 
score function, 310 
sd command, 688 
seed values, 492 
selection effect, 271 
series 

Taylor, 677 
series, infinite, 677 
set .seed command, 687 
sign statistic, 357 
sign test statistic, 357 
simple hypothesis, 466 
simple linear regression model, 540 
simple random sampling, 271, 272 
simple random walk, 615, 616 
Simpson’s paradox, 183 
sin command, 685 
size a rejection region, 448 


size a test function, 449 
skewed 
skewness, 286 
skewness statistic, 483 
SLLN (strong law of large numbers), 211 
smallest-order statistic, 104 
solve command, 698 
sort command, 688 
source command, 691 
sqrt command, 685 
squared error loss, 466 
St. Petersburg paradox, 133, 134, 141 
standard bivariate normal density, 89 
standard deviation, 150 
standard error, 221, 325 
standard error of the estimate, 323 
standard normal distribution, 57 
standardizing a random variable, 215 
state space, 623 
stationary distribution, 629 
statistical inference, 262 
statistical model, 262 
statistical model for a sample, 263 
Statistically significant, 335 
stochastic matrix, 624 
doubly, 631, 632 
stochastic process, 615 
continuous-time, 658, 666 
martingale, 650 
stock prices, 662 
stopping theorem, 653 
stopping time, 652 
stratified sampling, 281 
strength of a relationship, 513 
strong law of large numbers (SLLN), 211 
Student(n), 239 
subadditivity, 12 
subfair game, 617 
sufficient statistic, 302 
sugar pill, 521 
sum command, 685 
summary command, 689 
superfair game, 617 
surprise (P-value), 332 
survey sampling, 276 


759 


t command, 686 
t distribution, 239 
t(n), 239 
t-confidence intervals, 331 
t-statistic, 331 
t-test, 337 
t.test command, 688 
table command, 687 
tables 
binomial probabilities, 724 
x7 quantiles, 713 
F distribution quantiles, 715 
random numbers, 710 
standard normal cdf, 712 
t distribution quantiles, 714 
tail probability, 259 
tan command, 685 
Taylor series, 677 
test function, 449, 469 
test of hypothesis, 332 
test of significance, 332 
theorem of total expectation, 177 
total expectation, theorem of, 177 
total probability, law of, 11, 21 
total sum of squares, 544 
training set, 495 
transition probabilities, 623 
higher-order, 628 
transpose, 560 
treatment, 520 
two-sample t-confidence interval, 580 
two-sample t-statistic, 580 
two-sample t-test, 580 
two-sided tests, 337 
two-stage systems, 22 
two-way ANOVA, 586 
type I error, 448 
type II error, 448 
types of inferences, 289 


UMA (uniformly most accurate), 460 

UMP (uniformly most powerful), 449 

UMVU (uniformly minimum variance un- 
biased), 437 

unbiased, 437 

unbiased estimator, 322, 436 


760 


unbiasedness, hypothesis testing, 453 

unbounded random variable, 36 

underfitting, 481 

unif command, 686 

uniform distribution, 7, 53, 141, 142 

uniformly minimum variance unbiased (UMVU), 
437 

uniformly most accurate (UMA), 460 

uniformly most powerful (UMP), 449 

union, 8 

upper limit, 287 

utility function, 134, 141 

utility theory, 134, 141 


validation set, 495 

var command, 688 

variance, 149 

variance stabilizing transformations, 362 
Venn diagrams, 7 

volatility parameter, 662 

von Savant, M., 28 


weak law of large numbers (WLLN), 206 
Weibull distribution, 61 

whiskers, 287 

Wiener process, 657, 659 

Wiener, N., 2, 657 

WLLN (weak law of large numbers), 206 


z-confidence intervals, 328 
z-Statistic, 328 
z-test, 333 


Michael J. Evans 


Jeffrey S. Rosenthal 


Biography 

A fundamental concern of a theory of statistical 
inference is how statistical evidence is to be 
measured. A main theme of my research is 
concerned with providing such a definition, 
together with the theory of inference that 
follows from this and with its application to 
solve particular statistical problems. Another 
theme of my research is concerned with 
developing methodology to resolve issues 
inherent with subjective choices in statistics 
through elicitation, controlling for induced 
biases, model checking and checking for 
prior-data conflict. 


Jeffrey Seth Rosenthal FRSC 
FIMS (born October 13, 1967) 
is a Canadian statistician and 
nonfiction author. He is a 
professor in the University of 
Toronto's department of 
statistics, | cross-appointed 
with its department of 
mathematics. 


